Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Adding MD5 / SHA1 column to revision table (discussing r94289)

 

 

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


dvanliere at gmail

Aug 18, 2011, 7:40 AM

Post #1 of 41 (3902 views)
Permalink
Adding MD5 / SHA1 column to revision table (discussing r94289)

Hi!
I am starting this thread because Brion's revisionr94289 reverted
r94289 [0] stating "core schema change with no discussion" [1].
Bugs21860 [2] and25312 [3] advocate for the inclusion of a hash
column (either md5 or sha1) in the revision table. The primary use
case of this column will be to assist detecting reverts. I don't think
that data integrity is the primary reason for adding this column. The
huge advantage of having such a column is that it will not be longer
necessary to analyze full dumps to detect reverts, instead you can
look for reverts in the stub dump file by looking for the same hash
within a single page. The fact that there is a theoretical chance of a
collision is not very important IMHO, it would just mean that in very
rare cases in our research we would flag an edit being reverted while
it's not. The two bug reports contain quite long discussions and this
feature has also been discussed internally quite extensively but oddly
enough it hasn't happened yet on the mailinglist.

So let's have a discussion!

[0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289
[1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541
[2]https://bugzilla.wikimedia.org/show_bug.cgi?id=21860
[3]https://bugzilla.wikimedia.org/show_bug.cgi?id=25312

Best,

Diederik

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


erik at wikimedia

Sep 2, 2011, 4:34 PM

Post #2 of 41 (3769 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere <dvanliere [at] gmail> wrote:
> Hi!
> I am starting this thread because Brion's revisionr94289 reverted
> r94289 [0] stating "core schema change with no discussion" [1].

Bumping this: What are the remaining open questions regarding this
schema change? As Diederik explains, this has a number of potential
benefits especially for research/analysis, and I think it makes sense
to target it for 1.19 or 1.20 unless there are scary implications that
haven't been surfaced yet. :-)

--
Erik Mller
VP of Engineering and Product Development, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


afeldman at wikimedia

Sep 2, 2011, 5:20 PM

Post #3 of 41 (3770 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

Would it be possible to generate offline hashes for the bulk of our revision
corpus via dumps and load that into prod to minimize the time and impact of
the backfill?

When using for analysis, will we wish the new columns had partial indexes
(first 6 characters?)

Is code written to populate rev_sha1 on each new edit?

On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere <dvanliere [at] gmail>wrote:

> Hi!
> I am starting this thread because Brion's revision r94289 reverted
> r94289 [0] stating "core schema change with no discussion" [1].
> Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash
> column (either md5 or sha1) in the revision table. The primary use
> case of this column will be to assist detecting reverts. I don't think
> that data integrity is the primary reason for adding this column. The
> huge advantage of having such a column is that it will not be longer
> necessary to analyze full dumps to detect reverts, instead you can
> look for reverts in the stub dump file by looking for the same hash
> within a single page. The fact that there is a theoretical chance of a
> collision is not very important IMHO, it would just mean that in very
> rare cases in our research we would flag an edit being reverted while
> it's not. The two bug reports contain quite long discussions and this
> feature has also been discussed internally quite extensively but oddly
> enough it hasn't happened yet on the mailinglist.
>
> So let's have a discussion!
>
> [0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289
> [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541
> [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=21860
> [3] https://bugzilla.wikimedia.org/show_bug.cgi?id=25312
>
> Best,
>
> Diederik
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lists at nadir-seen-fire

Sep 2, 2011, 5:47 PM

Post #4 of 41 (3770 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

Bug 2939 is one relevant bug to this, it could probably use an index.

[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

On 11-09-02 05:20 PM, Asher Feldman wrote:
> Would it be possible to generate offline hashes for the bulk of our revision
> corpus via dumps and load that into prod to minimize the time and impact of
> the backfill?
>
> When using for analysis, will we wish the new columns had partial indexes
> (first 6 characters?)
>
> Is code written to populate rev_sha1 on each new edit?
>
> On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere <dvanliere [at] gmail>wrote:
>
>> Hi!
>> I am starting this thread because Brion's revision r94289 reverted
>> r94289 [0] stating "core schema change with no discussion" [1].
>> Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash
>> column (either md5 or sha1) in the revision table. The primary use
>> case of this column will be to assist detecting reverts. I don't think
>> that data integrity is the primary reason for adding this column. The
>> huge advantage of having such a column is that it will not be longer
>> necessary to analyze full dumps to detect reverts, instead you can
>> look for reverts in the stub dump file by looking for the same hash
>> within a single page. The fact that there is a theoretical chance of a
>> collision is not very important IMHO, it would just mean that in very
>> rare cases in our research we would flag an edit being reverted while
>> it's not. The two bug reports contain quite long discussions and this
>> feature has also been discussed internally quite extensively but oddly
>> enough it hasn't happened yet on the mailinglist.
>>
>> So let's have a discussion!
>>
>> [0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289
>> [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541
>> [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=21860
>> [3] https://bugzilla.wikimedia.org/show_bug.cgi?id=25312
>>
>> Best,
>>
>> Diederik
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>

--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


robla at wikimedia

Sep 2, 2011, 9:33 PM

Post #5 of 41 (3775 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen
<lists [at] nadir-seen-fire> wrote:
> On 11-09-02 05:20 PM, Asher Feldman wrote:
>> When using for analysis, will we wish the new columns had partial indexes
>> (first 6 characters?)

> Bug 2939 is one relevant bug to this, it could probably use an index.
> [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939

I generally suspect that a standard index is going to be a waste for
the most urgent uses of this. It will rarely be interesting to search
for common hashes between articles. The far more common case will be
to search for duplicate hashes within the history of a single article.
My understanding is that having a normal index on a table the size of
our revision table will be far too expensive for db writes.

This is the first I've heard of partial indexes (/me researches) I
don't know if a partial index is going to be cheap enough that we can
use it, and useful enough that we'd want to. Would this be a faster
query in a world with a partial index on the first six characters?
SELECT rev_id FROM revision WHERE rev_page=12345 AND
rev_sha1='4cdbd80be15fcfff139fb8a95f2ca359520939ee'

...or would we have to run a query like this to get the benefit of the index?
SELECT rev_id FROM revision WHERE rev_page=12345 AND rev_sha1 like '4cdbd8%'

...and would either of these queries be considered too expensive to
run without a partial index? How about with a partial index?

Rob

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lists at nadir-seen-fire

Sep 2, 2011, 10:16 PM

Post #6 of 41 (3768 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

On 11-09-02 09:33 PM, Rob Lanphier wrote:
> On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen
> <lists [at] nadir-seen-fire> wrote:
>> On 11-09-02 05:20 PM, Asher Feldman wrote:
>>> When using for analysis, will we wish the new columns had partial indexes
>>> (first 6 characters?)
>> Bug 2939 is one relevant bug to this, it could probably use an index.
>> [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939
> My understanding is that having a normal index on a table the size of
> our revision table will be far too expensive for db writes.
> ...
> Rob
We've got 5 normal indexes on revision:
- A unique int+int
- A binary(14)
- An int+binary(14)
- Another int+binary(14)
- And a varchar(255)+binary(14)

That bug wise a (rev_page,rev_sha1) or (rev_page,rev_timestamp,rev_sha1)
may do.

--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dvanliere at gmail

Sep 3, 2011, 2:13 AM

Post #7 of 41 (3790 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

Hi,

I've suggested to generate bulk checksums as well but both Brion and Ariel see the primary purpose of this field to check the validity of the dump generating process and so they want to generate the checksums straight from the external storage.

In a general sense, there are two use cases for this new field:
1) Checking the validity of the XML dump files
2) Identifying reverts

I have started to work on a proposal for deployment (and while being incomplete) it might be a good start to further plan the deployment. I have been trying to come up with some back-of-the-envelope calculations about how much time and space it would take but I don't have all the required information yet to come up with some reasonable estimates.

You can find the proposal here: http://strategy.wikimedia.org/wiki/Proposal:Implement_and_deploy_checksum_revision_table

I want to thank Brion and Asher for giving feedback on prior drafts. Please feel free to improve this proposal.

Best,
Diederik

PS: not sure if this proposal should be on strategy or mediawiki...


On 2011-09-03, at 7:16 AM, Daniel Friesen wrote:

> On 11-09-02 09:33 PM, Rob Lanphier wrote:
>> On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen
>> <lists [at] nadir-seen-fire> wrote:
>>> On 11-09-02 05:20 PM, Asher Feldman wrote:
>>>> When using for analysis, will we wish the new columns had partial indexes
>>>> (first 6 characters?)
>>> Bug 2939 is one relevant bug to this, it could probably use an index.
>>> [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939
>> My understanding is that having a normal index on a table the size of
>> our revision table will be far too expensive for db writes.
>> ...
>> Rob
> We've got 5 normal indexes on revision:
> - A unique int+int
> - A binary(14)
> - An int+binary(14)
> - Another int+binary(14)
> - And a varchar(255)+binary(14)
>
> That bug wise a (rev_page,rev_sha1) or (rev_page,rev_timestamp,rev_sha1)
> may do.
>
> --
> ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at gmail

Sep 3, 2011, 2:51 AM

Post #8 of 41 (3756 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

On Sat, Sep 3, 2011 at 2:20 AM, Asher Feldman <afeldman [at] wikimedia> wrote:
> Is code written to populate rev_sha1 on each new edit?
>
I believe that was part of Aaron's code that got reverted, yes.

Offline generation of hashes is definitely possible, but the only
reason you'd do it is to minimize the time during which some rows have
an empty hash. It's not strictly offline in that the text will still
have to be pulled from ES, but it can be done before the field is
added.

Roan

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


z at mzmcbride

Sep 3, 2011, 6:03 PM

Post #9 of 41 (3772 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

Diederik van Liere wrote:
> I've suggested to generate bulk checksums as well but both Brion and Ariel see
> the primary purpose of this field to check the validity of the dump generating
> process and so they want to generate the checksums straight from the external
> storage.
>
> [...]
>
> PS: not sure if this proposal should be on strategy or mediawiki...

I think standard practice nowadays is a subpage of
<http://www.mediawiki.org/wiki/Requests_for_comment>.

MZMcBride



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


krinklemail at gmail

Sep 4, 2011, 6:29 AM

Post #10 of 41 (3766 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

2011/9/4 MZMcBride <z [at] mzmcbride>

> Diederik van Liere wrote:
> > I've suggested to generate bulk checksums as well but both Brion and
> Ariel see
> > the primary purpose of this field to check the validity of the dump
> generating
> > process and so they want to generate the checksums straight from the
> external
> > storage.
> >
> > [...]
> >
> > PS: not sure if this proposal should be on strategy or mediawiki...
>
> I think standard practice nowadays is a subpage of
> <http://www.mediawiki.org/wiki/Requests_for_comment>.
>
> MZMcBride
>
>
Indeed. Moved:
http://mediawiki.org/wiki/Requests_for_comment/Database_field_for_checksum_of_page_text


Krinkle
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ayg at aryeh

Sep 4, 2011, 11:01 AM

Post #11 of 41 (3757 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

On Sat, Sep 3, 2011 at 12:33 AM, Rob Lanphier <robla [at] wikimedia> wrote:
> I generally suspect that a standard index is going to be a waste for
> the most urgent uses of this.  It will rarely be interesting to search
> for common hashes between articles.  The far more common case will be
> to search for duplicate hashes within the history of a single article.
>  My understanding is that having a normal index on a table the size of
> our revision table will be far too expensive for db writes.

If it's useful for Wikimedia and the index is expensive to maintain,
the index can be kept on only one slave per DB cluster to minimize
cost, the way we've always done for RC. If it's not useful for
Wikimedia, only third-party researchers, then we could leave it
unindexed on the cluster and have an index only on the toolserver or
whatnot.

It's not so much that an extra index would be untenably expensive in
principle, more that it's not worth the cost if it's not used for
anything important. If it's just for statistical analysis and not for
the live site, there's no reason to have it on all servers, except
maybe administrative simplicity. The toolserver already has a bunch
of indexes that Wikimedia doesn't, for exactly this sort of reason.

> This is the first I've heard of partial indexes (/me researches)  I
> don't know if a partial index is going to be cheap enough that we can
> use it, and useful enough that we'd want to.  Would this be a faster
> query in a world with a partial index on the first six characters?
> SELECT rev_id FROM revision WHERE rev_page=12345 AND
> rev_sha1='4cdbd80be15fcfff139fb8a95f2ca359520939ee'

I'd think so, yes. MySQL should be smart enough to use partial
indexes at least that far -- otherwise there'd be no point in
supporting them. I'd think an index on the first several bytes should
be almost as effective as one on the whole value, if you just want to
do filtering or joining (not sorting). It might be more effective,
since more would fit in RAM. However, I don't know for sure offhand.

> ...and would either of these queries be considered too expensive to
> run without a partial index?  How about with a partial index?

Without a suitable index, running either of these queries would scan
the entire revision history of the article in question. That would
certainly not be acceptable on the cluster if the query is to be run
with any frequency on arbitrary articles. It would be okay if it were
just an occasional thing run on a limited number of articles, or only
on articles with few revisions in their histories.

All of the above should be interpreted with the understanding that I
stand a decent chance of knowing what I'm talking about, but have no
say at this point and shouldn't be taken too seriously. :)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dvanliere at gmail

Sep 4, 2011, 1:23 PM

Post #12 of 41 (3756 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

Thanks for moving the page.
Diederik

On 2011-09-04, at 3:29 PM, Krinkle wrote:

> 2011/9/4 MZMcBride <z [at] mzmcbride>
>
>> Diederik van Liere wrote:
>>> I've suggested to generate bulk checksums as well but both Brion and
>> Ariel see
>>> the primary purpose of this field to check the validity of the dump
>> generating
>>> process and so they want to generate the checksums straight from the
>> external
>>> storage.
>>>
>>> [...]
>>>
>>> PS: not sure if this proposal should be on strategy or mediawiki...
>>
>> I think standard practice nowadays is a subpage of
>> <http://www.mediawiki.org/wiki/Requests_for_comment>.
>>
>> MZMcBride
>>
>>
> Indeed. Moved:
> http://mediawiki.org/wiki/Requests_for_comment/Database_field_for_checksum_of_page_text
>
>
> Krinkle
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


mail at tgries

Sep 15, 2011, 11:15 PM

Post #13 of 41 (3715 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

RE:
http://www.mediawiki.org/wiki/Requests_for_comment/Database_field_for_checksum_of_page_text#Field_type

Recently, Adding MD5 / SHA1 column to revision table (discussing r94289)
was discussed.

For some applications, I use the technique of representing the 128 bit
of md5 or other checksums

as base-62 character strings
instead of hexadecimal (base-16) strings.

When you need a non-binary, i.e. character representation for
displaying, storing or transmitting checksums,
you should always consider to use base-62 instead of base-16
representations (for md5 you need 22 bytes instead of 32 bytes).

When you encode the 128 bits of MD5 (example) in base-16 aka
/hexadecimal/, you need CHAR(32).
When you use the technique of enoding the 128 bits in /base-62/ with
characters from [A-Za-z0-9], you'll need CHAR(22).

See http://de.php.net/manual/en/function.md5.php#83321 for one possible
implementation.

I found this for example first in Microsofts free FCIV program which
creates checksum xml files and uses this shorter representation.
Attachments: signature.asc (0.48 KB)


roan.kattouw at gmail

Sep 16, 2011, 2:24 AM

Post #14 of 41 (3709 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

On Fri, Sep 16, 2011 at 8:15 AM, Thomas Gries <mail [at] tgries> wrote:
> For some applications, I use the technique of representing the 128 bit
> of md5 or other checksums
>
> as base-62 character strings
> instead of hexadecimal (base-16) strings.
>
MediaWiki already uses a similar technique, storing SHA-1 hashes of
images in base 36.

Roan

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


mail at tgries

Sep 16, 2011, 9:48 AM

Post #15 of 41 (3729 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

Am 16.09.2011 11:24, schrieb Roan Kattouw:
> For some applications, I use the technique of representing the 128 bit
> of md5 or other checksums
>
> as base-62 character strings
> instead of hexadecimal (base-16) strings.

> MediaWiki already uses a similar technique, storing SHA-1 hashes of
> images in base 36.
Was there a certain reason to chose base 36 ?
Why not recoding to base 62 and saving 3 bytes per checksum ?

base 16 = CHAR(32) md5 sum in hexadecimal
base 36 = CHAR(25)
base 62 = CHAR(22)

using base 62 (upper and lower ASCII letters and digits) would save 3
bytes over base 36.
if my calculator works correctly ;-)
Attachments: signature.asc (0.48 KB)


neilk at wikimedia

Sep 16, 2011, 10:05 AM

Post #16 of 41 (3714 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

On 9/16/11 9:48 AM, Thomas Gries wrote:
> Am 16.09.2011 11:24, schrieb Roan Kattouw:
>> For some applications, I use the technique of representing the 128 bit
>> of md5 or other checksums
>>
>> as base-62 character strings
>> instead of hexadecimal (base-16) strings.
>
>> MediaWiki already uses a similar technique, storing SHA-1 hashes of
>> images in base 36.
> Was there a certain reason to chose base 36 ?
> Why not recoding to base 62 and saving 3 bytes per checksum ?

11M images in commons x 3 bytes = 33MB / ~1MB per file on average =
a savings equivalent to 33 more files in Commons

In the time it took you to write this we probably had more than 33 files
uploaded.

--
Neil Kandalgaonkar (| <neilk [at] wikimedia>

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lists at nadir-seen-fire

Sep 16, 2011, 10:10 AM

Post #17 of 41 (3733 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

On 11-09-16 09:48 AM, Thomas Gries wrote:
> Am 16.09.2011 11:24, schrieb Roan Kattouw:
>> For some applications, I use the technique of representing the 128 bit
>> of md5 or other checksums
>>
>> as base-62 character strings
>> instead of hexadecimal (base-16) strings.
>> MediaWiki already uses a similar technique, storing SHA-1 hashes of
>> images in base 36.
> Was there a certain reason to chose base 36 ?
> Why not recoding to base 62 and saving 3 bytes per checksum ?
>
> base 16 = CHAR(32) md5 sum in hexadecimal
> base 36 = CHAR(25)
> base 62 = CHAR(22)
>
> using base 62 (upper and lower ASCII letters and digits) would save 3
> bytes over base 36.
> if my calculator works correctly ;-)
If we're picking apart checksum sizes in database storage. Why not just
go all the way and store the binary data as binary with a BINARY(16)
16, 36, 62... what happened to base 64 anyways? php even has native code
for that.

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


brion at pobox

Sep 16, 2011, 10:19 AM

Post #18 of 41 (3714 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

On Fri, Sep 16, 2011 at 9:48 AM, Thomas Gries <mail [at] tgries> wrote:

> Am 16.09.2011 11:24, schrieb Roan Kattouw:
> > For some applications, I use the technique of representing the 128 bit
> > of md5 or other checksums
> >
> > as base-62 character strings
> > instead of hexadecimal (base-16) strings.
>
> > MediaWiki already uses a similar technique, storing SHA-1 hashes of
> > images in base 36.
> Was there a certain reason to chose base 36 ?
> Why not recoding to base 62 and saving 3 bytes per checksum ?
>

This format was chosen for hashes to be used as filenames for uploaded file
storage (currently used only for storing deleted files, I think, but there's
long been a long-term plan to switch primary image storage to this as well
some day).

For greatest compatibility with all filesystems, we only use characters that
are safe (ASCII digits and letters) and don't rely on case distinctions
which are not always preserved (Windows and Mac OS X systems default to
case-insensitive filesystems).

The reason we're not using hex here is that a more compact representation
makes the filenames, and thus any URL references including them in the path,
shorter. On img_sha1 I guess we just kept using it for
compatibility/similarity with the deleted file archives?

-- brion
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at gmail

Sep 17, 2011, 8:26 AM

Post #19 of 41 (3725 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

On Fri, Sep 16, 2011 at 6:48 PM, Thomas Gries <mail [at] tgries> wrote:
> Was there a certain reason to chose base 36 ?
> Why not recoding to base 62 and saving 3 bytes per checksum ?
>
I don't know, this was way, way before my time. But then, why use base
62 if you can use base 64? Encoders/decoders for that are much more
widely available. PHP's base_convert() will go up to 36, I think, so
you'd have to write a base 62 encoder/decoder yourself, but PHP has
built-in functions to deal with base 64.

> base 16 = CHAR(32) md5 sum in hexadecimal
> base 36 = CHAR(25)
> base 62 = CHAR(22)
>
Minor detail: I think it's more likely we'll use SHA-1 hashes rather
than MD5 hashes.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Platonides at gmail

Sep 17, 2011, 2:52 PM

Post #20 of 41 (3701 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

Roan Kattouw wrote:
> On Fri, Sep 16, 2011 at 6:48 PM, Thomas Gries<mail [at] tgries> wrote:
>> Was there a certain reason to chose base 36 ?
>> Why not recoding to base 62 and saving 3 bytes per checksum ?
>>
> I don't know, this was way, way before my time. But then, why use base
> 62 if you can use base 64? Encoders/decoders for that are much more
> widely available. PHP's base_convert() will go up to 36, I think, so
> you'd have to write a base 62 encoder/decoder yourself, but PHP has
> built-in functions to deal with base 64.

Brion explained it on previous mail. And yes, we have our own converter
in MediaWiki.




_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rarohde at gmail

Sep 17, 2011, 3:46 PM

Post #21 of 41 (3715 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

On Sat, Sep 17, 2011 at 8:26 AM, Roan Kattouw <roan.kattouw [at] gmail> wrote:
> Minor detail: I think it's more likely we'll use SHA-1 hashes rather
> than MD5 hashes.

Is there a good reason to prefer SHA-1?

Both have weaknesses allowing one to construct a collision (with
considerable effort), but I wouldn't see why that would matter for the
proposed use.

With only about 1 billion revisions in the collective databases, the
odds of an accidental collision with either MD5 or SHA-1 is
infinitesimal (less than 1 in 10^18 for the weaker MD5).

MD5 is shorter and in my experience about 25% faster to compute.

Personally I've tended to view MD5 as more than good enough in offline analyses.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


wikimail at inbox

Sep 17, 2011, 4:56 PM

Post #22 of 41 (3705 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde <rarohde [at] gmail> wrote:
> Is there a good reason to prefer SHA-1?
>
> Both have weaknesses allowing one to construct a collision (with
> considerable effort)

Considerable effort? I can create an MD5 collision in a few minutes
on my home computer. Is there anything even remotely like this for
SHA-1?

> MD5 is shorter and in my experience about 25% faster to compute.
>
> Personally I've tended to view MD5 as more than good enough in offline analyses.

For offline analyses, there's no need to change the online database tables.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rarohde at gmail

Sep 17, 2011, 10:55 PM

Post #23 of 41 (3713 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

On Sat, Sep 17, 2011 at 4:56 PM, Anthony <wikimail [at] inbox> wrote:
> On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde <rarohde [at] gmail> wrote:
>> Is there a good reason to prefer SHA-1?
>>
>> Both have weaknesses allowing one to construct a collision (with
>> considerable effort)
>
> Considerable effort? I can create an MD5 collision in a few minutes
> on my home computer. Is there anything even remotely like this for
> SHA-1?

If I've been keeping up to date, the collision complexity for MD5 is
about 2^21 operations, and runs in a few seconds (not minutes); and
for SHA-1 down to about 2^52 with current results. The latter
represents about 100 cpu-years, which is within the realm of
supercomputers. That time will probably continue to come down if
people find ways to improve the attacks on SHA-1. (The existing
attacks usually require the ability to feed arbitrary binary strings
into the hash function. Given that both browsers and Mediawiki will
tend to reject binary data placed in an edit window, I'm not sure if
any of the existing attacks could be reliably applied to Mediawiki
editing.)

If collision attacks really matter we should use SHA-1. However, do
any of the proposed use cases care about whether someone might
intentionally inject a collision? In the proposed uses I've looked at
it, it seems irrelevant. The intentional collision will get flagged
as a revert and the text leading to that collision would be discarded.
How is that a bad thing?

It's a not a big deal, but if I understand prior comments correctly,
most of the existing offline infrastructure uses MD5, so I'm wondering
if there is a distinct use case for favoring SHA-1.

>> MD5 is shorter and in my experience about 25% faster to compute.
>>
>> Personally I've tended to view MD5 as more than good enough in offline analyses.
>
> For offline analyses, there's no need to change the online database tables.

Need? That's debatable, but one of the major motivators is the desire
to have hash values in database dumps (both for revert checks and for
checksums on correct data import / export). Both of those are
"offline" uses, but it is beneficial to have that information
precomputed and stored rather than frequently regenerated.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ariel at wikimedia

Sep 17, 2011, 11:33 PM

Post #24 of 41 (3717 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

Στις 17-09-2011, ημέρα Σαβ, και ώρα 22:55 -0700, ο/η Robert Rohde
έγραψε:
> On Sat, Sep 17, 2011 at 4:56 PM, Anthony <wikimail [at] inbox> wrote:

<snip>

> > For offline analyses, there's no need to change the online database tables.
>
> Need? That's debatable, but one of the major motivators is the desire
> to have hash values in database dumps (both for revert checks and for
> checksums on correct data import / export). Both of those are
> "offline" uses, but it is beneficial to have that information
> precomputed and stored rather than frequently regenerated.

If we don't have it in the online database tables, this defeats the
purpose of having the value in there at all, for the purpose of
generating the XML dumps.

Recall that the dumps are generated in two passes; in the first pass we
retrieve from the db and record all of the metadata about revisions, and
in the second (time-comsuming) pass we re-use the text of the revisions
from a previous dump file if the text is in there. We want to compare
the has of that text against what the online database says the hash is;
if they don't match, we want to fetch the live copy.

I refer folks to bug 23264 [1] as proof that mismatch between the
metadata and the text has crept in in the past; changes to MW code in
other places than the backups scripts can cause quite subtle breakage.

Ariel

[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=23264



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rnnelson at clarkson

Sep 18, 2011, 4:24 AM

Post #25 of 41 (3766 views)
Permalink
Re: Adding MD5 / SHA1 column to revision table (discussing r94289) [In reply to]

It is meaningless to talk about cryptography without a threat model, just as Robert says. Is anybody actually attacking us? Or are we worried about accidental collisions?

Sent from my Verizon Wireless Phone


-----Original message-----
From: Robert Rohde <rarohde [at] gmail>
To: Wikimedia developers <wikitech-l [at] lists>
Sent: Sun, Sep 18, 2011 05:56:15 GMT+00:00
Subject: Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

On Sat, Sep 17, 2011 at 4:56 PM, Anthony <wikimail [at] inbox> wrote:
> On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde <rarohde [at] gmail> wrote:
>> Is there a good reason to prefer SHA-1?
>>
>> Both have weaknesses allowing one to construct a collision (with
>> considerable effort)
>
> Considerable effort? I can create an MD5 collision in a few minutes
> on my home computer. Is there anything even remotely like this for
> SHA-1?

If I've been keeping up to date, the collision complexity for MD5 is
about 2^21 operations, and runs in a few seconds (not minutes); and
for SHA-1 down to about 2^52 with current results. The latter
represents about 100 cpu-years, which is within the realm of
supercomputers. That time will probably continue to come down if
people find ways to improve the attacks on SHA-1. (The existing
attacks usually require the ability to feed arbitrary binary strings
into the hash function. Given that both browsers and Mediawiki will
tend to reject binary data placed in an edit window, I'm not sure if
any of the existing attacks could be reliably applied to Mediawiki
editing.)

If collision attacks really matter we should use SHA-1. However, do
any of the proposed use cases care about whether someone might
intentionally inject a collision? In the proposed uses I've looked at
it, it seems irrelevant. The intentional collision will get flagged
as a revert and the text leading to that collision would be discarded.
How is that a bad thing?

It's a not a big deal, but if I understand prior comments correctly,
most of the existing offline infrastructure uses MD5, so I'm wondering
if there is a distinct use case for favoring SHA-1.

>> MD5 is shorter and in my experience about 25% faster to compute.
>>
>> Personally I've tended to view MD5 as more than good enough in offline analyses.
>
> For offline analyses, there's no need to change the online database tables.

Need? That's debatable, but one of the major motivators is the desire
to have hash values in database dumps (both for revert checks and for
checksums on correct data import / export). Both of those are
"offline" uses, but it is beneficial to have that information
precomputed and stored rather than frequently regenerated.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.