#113381 - 27/08/2002 20:14
jEmplode 42 pre 5
|
pooh-bah
Registered: 09/09/2000
Posts: 2303
Loc: Richmond, VA
|
http://www.jempeg.org/jemplode20-stonecutters.jar
What's New:
1) Added CTIME tag support
2) Fixed copying from search soups that talk about "refs" (pasting would increment the refs and remove the node from the selection causing all the others to be off by one)
3) Transient soups colorize
4) fixed weird problem with refs soup w/ importing new tunes (just imported tunes would get stuck into refs = 0 soup for an instant, but just long enough to get synced). It would get cleaned up the next time automagically, but it turned the refs = 0 soup red. Kind of annoying. It's fixed.
5) Database wide deduping on import. This defaults to on, but you can turn it off in options. Basically, I compute a CRC32 on the data (i.e. non-tag) portion of the tune and append the length of that section to the CRC and use that as the hash for the tune. This hash is stored in a new tag for imported tunes called "hash". When a database is downloaded, i keep a Hashtable of Hash=>Node mappings so I can lookup a node at import time. This uses up more RAM (maybe 30 or so bytes per hashed tune) at jEmplode runtime and you take a performance hit at import time to compute the hash. I tried MD5 originally but WOW was it slow. I'm very curious to hear how this works for people and if they get false positives for duplicates.
6) Because of #5, mcomb's wish to have soup imports dedupe comes true.
Known Issues:
Still not checking for loops when you paste a playlist. For some reason I can't bring myself to actually focus and write this code ... Oh well.
I want to get the loop check in, and hear back from people about problems, then I'm calling this 42 and will likely also do an official release.
Mike
|
Top
|
|
|
|
#113383 - 27/08/2002 20:24
Re: jEmplode 42 pre 5
[Re: mschrag]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Basically, I compute a CRC32 on the data ... I tried MD5 originally but WOW was it slow. You might want to try Adler32. It should be faster than the CRC and it's built into Java (at least 1.3 and 1.4).
_________________________
Bitt Faulk
|
Top
|
|
|
|
#113384 - 27/08/2002 20:43
Re: jEmplode 42 pre 5
[Re: mschrag]
|
carpal tunnel
Registered: 23/09/2000
Posts: 3608
Loc: Minnetonka, MN
|
Ok here's the soup I am trying to do if someone could let me know if I can do it and how it would be great because I can't figure it out.
Artists
......0-E
........Albums
.............Tracks
......F-J
........Albums
..............Tracks
......K-O
.........Albums
..............Tracks
......P-T
.........Albums
..............Tracks
......U-Z
.........Albums
..............Tracks
Thanks again Mike for making all this cool stuff
_________________________
Matt
|
Top
|
|
|
|
#113385 - 27/08/2002 22:37
Re: jEmplode 42 pre 5
[Re: mschrag]
|
carpal tunnel
Registered: 24/01/2002
Posts: 3937
Loc: Providence, RI
|
and you take a performance hit at import time to compute the hash.
And it's very noticeable. I have all my tunes in a distributed filesystem, and I figured... eh, sucking them across a wireless network and then cramming them back is sucking like usual. So I switched to wired, and it was still slower than before (usually I remember to just do it wired right along).
Given that I usually batch imports and then go to sleep, I'll probably disable it, but it's a neat idea.
|
Top
|
|
|
|
#113386 - 28/08/2002 05:37
Re: jEmplode 42 pre 5
[Re: msaeger]
|
pooh-bah
Registered: 09/09/2000
Posts: 2303
Loc: Richmond, VA
|
Layer 1: Tag Range on Artist 0-E,F-J,K-O,P-T,U-Z
Layer 2: Tag By Artist (If you want the actual name of the artist as a layer -- leave this out if you just want to segment by Artist first letter)
Layer 3: Tag By Album
Mike
|
Top
|
|
|
|
#113388 - 28/08/2002 05:45
Re: jEmplode 42 pre 5
[Re: wfaulk]
|
pooh-bah
Registered: 09/09/2000
Posts: 2303
Loc: Richmond, VA
|
Oh cool -- I'll check it out ...
[moments later ]
on the same 16M file, I get the following results:
CRC32 = 1211ms
MD5 = 2644ms
ADLER32 = 3826ms
I'm going to look more into it when I get home though.
|
Top
|
|
|
|
#113389 - 28/08/2002 10:48
Re: jEmplode 42 pre 5
[Re: mschrag]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Ouch. The whole point of Adler32 is that it's supposed to be faster. Maybe someone has a better implementation than the one Java provides.
Edit: I realize that it's not Java, but these benchmarks show Adler32 running 2-5 times as fast as CRC32, which itself is 1.5-4 times as fast as MD5.
Edited by wfaulk (28/08/2002 10:54)
_________________________
Bitt Faulk
|
Top
|
|
|
|
#113390 - 28/08/2002 11:02
Re: jEmplode 42 pre 5
[Re: wfaulk]
|
pooh-bah
Registered: 09/09/2000
Posts: 2303
Loc: Richmond, VA
|
I suppose so much of it is implementation ... I'll look for an optimized implementation. The CRC32 impl I'm using is the one I ported from emptool. Not sure how it stacks up implementation wise, but it seems pretty quick ...
|
Top
|
|
|
|
#113391 - 28/08/2002 11:25
Re: jEmplode 42 pre 5
[Re: wfaulk]
|
pooh-bah
Registered: 09/09/2000
Posts: 2303
Loc: Richmond, VA
|
Just out of curiousity I benchmarked JDK 1.4's Adler32 against JDK 1.4's CRC32.. I ran it 3 times on the same 5 meg random dataset. Very strange ... Sun's Adler32 implementation must suck. Numbers are in millis.
java.util.zip.CRC32: 581
java.util.zip.Adler32: 1001
java.util.zip.CRC32: 591
java.util.zip.Adler32: 1002
java.util.zip.CRC32: 630
java.util.zip.Adler32: 1002
|
Top
|
|
|
|
#113392 - 28/08/2002 11:52
Re: jEmplode 42 pre 5
[Re: mschrag]
|
pooh-bah
Registered: 31/08/1999
Posts: 1649
Loc: San Carlos, CA
|
6) Because of #5, mcomb's wish to have soup imports dedupe comes true.
Sweet. Thanks Mike. For the speed issue, do you really need to compute a checksum for the entire file? Wouldn't a checksum for the first 10 or 20k be just as likely to produce a unique identifier?
-Mike
|
Top
|
|
|
|
#113393 - 28/08/2002 11:56
Re: jEmplode 42 pre 5
[Re: mcomb]
|
pooh-bah
Registered: 09/09/2000
Posts: 2303
Loc: Richmond, VA
|
I thought about this one ... I may just do that. I would imagine after a certain amount of data, if it's not already unique, it's not gonna be. I'll try 20k and see how it goes. I probably need to have a UI for dealing with collisions too, but since i combine the length of the data portion with the CRC, I am hoping collisions will be very unlikely.
I'll try this tonight...
|
Top
|
|
|
|
#113394 - 28/08/2002 12:03
Re: jEmplode 42 pre 5
[Re: mschrag]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31594
Loc: Seattle, WA
|
Or maybe the first 10k plus the last 10k?
|
Top
|
|
|
|
#113395 - 28/08/2002 12:04
Re: jEmplode 42 pre 5
[Re: tfabris]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31594
Loc: Seattle, WA
|
Then again I dunno. In theory, a re-rip might have the same data at the beginning and end if you were trying to correct an error in the middle... And the whole point of doing the CRC checking is so that it correctly recognizes things like re-rips of the same tune and uploads them... right?
|
Top
|
|
|
|
#113396 - 28/08/2002 12:07
Re: jEmplode 42 pre 5
[Re: tfabris]
|
pooh-bah
Registered: 09/09/2000
Posts: 2303
Loc: Richmond, VA
|
Well -- this is an interesting question. Some people would say that a rerip is the same song, some would say it's not. Odds of both the CRC _and_ the data-portion length being identical in two different rips is probably pretty slim though, so it would currently upload again.
|
Top
|
|
|
|
#113397 - 28/08/2002 12:23
Re: jEmplode 42 pre 5
[Re: mschrag]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31594
Loc: Seattle, WA
|
Some people would say that a rerip is the same song, some would say it's not.
I don't see what the point of CRC checking would be, other than to make sure to detect re-rips on your PC and replace the tunes if they had really changed.
|
Top
|
|
|
|
#113398 - 28/08/2002 12:36
Re: jEmplode 42 pre 5
[Re: tfabris]
|
pooh-bah
Registered: 09/09/2000
Posts: 2303
Loc: Richmond, VA
|
The benefit of the CRC checking from my perspective was that you can just drop any tunes in the soup and not have to worry if you've already uploaded them before.
To be able to identify a rerip, you really need a hash function that is designed for music (like a soundprint). Right now we're looking strictly at file contents.
So say you drop a tune on and the hashes match (i.e. you already have this tune on your Empeg). Right now it skips that file as a duplicate -- are you saying you want the old tune to be removed and the new one to take it's place instead?
|
Top
|
|
|
|
#113399 - 28/08/2002 12:43
Re: jEmplode 42 pre 5
[Re: mschrag]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31594
Loc: Seattle, WA
|
Right now it skips that file as a duplicate -- are you saying you want the old tune to be removed and the new one to take it's place instead?
Exactly. Why else would I have a different version of the same song on my PC unless I was fixing a bad rip?
For tune-duplicate-checking, the tag information combined with file size is plenty. But if the actual bits are different, then you know you re-ripped the tune.
When you talk about doing soundprint signatures, that would identify the song, but not whether it was a re-rip. Two different rips would appear (correctly) to be the same to a soundprint signature.
And what about not just re-ripping, but upgrading the bit rate too? Say I've got an album at 128 on my player and on my PC. Then I re-rip at 256 and copy over those files on the PC, then drop them onto the soup. It should say, "ah, cool, this is a new rip" and replace all the FIDs.
|
Top
|
|
|
|
#113400 - 28/08/2002 12:47
Re: jEmplode 42 pre 5
[Re: tfabris]
|
pooh-bah
Registered: 09/09/2000
Posts: 2303
Loc: Richmond, VA
|
The only way to have different data bits but be able to identify duplicates is a soundprint, though. Unless there's a cool open source sound print algorithm, I don't know that I'll ever be able to have two rips that i can identify as the same tune.
I think you need both checks, actually. If the strict-data hash matches, then you are just dropping the exact same song in, and that _should_ be skipped ("Oh -- i see you already have this exact song on your Empeg, I won't put it on a second time"). If not, then you would soundprint them and if the sound print matches then you know it is a different version of the same song ("Oh -- this is the same song, but the first match failed so I know the bytes are different -- I need to replace the tune").
Anyone know of a soundprint algorithm?
|
Top
|
|
|
|
#113401 - 28/08/2002 12:51
Re: jEmplode 42 pre 5
[Re: mschrag]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31594
Loc: Seattle, WA
|
For the basic purpose you originally stated... Dropping a folder onto the player and not having it add the old tunes twice... wouldn't the file size and tag data be enough of a check to ID it as the same song?
Hmm, but then that wouldn't cover the bitrate-upgrade situation... And it wouldn't cover poor tagging practices... Hmm the soundprint stuff is looking better all the time, isn't it...
Didn't someone here on the BBS actually work on a soundprint application that went opensource, and we did a whole thread on it?
|
Top
|
|
|
|
#113402 - 28/08/2002 12:54
Re: jEmplode 42 pre 5
[Re: tfabris]
|
pooh-bah
Registered: 09/09/2000
Posts: 2303
Loc: Richmond, VA
|
I thought so too .. I thought it was actually called "soundprint".
By the way -- the reason I can't trust tags is that you can change the tags outside of jEmplode but it's still the same song. So I have to create a hash of only the music portion of the file and store that with the tune on the Empeg. That way when you drop a tune again, I can hash the tune you just dropped and look for a match. But you're right -- the bitrate change would break that.
I'll go hunting for a soundprint...
|
Top
|
|
|
|
#113403 - 28/08/2002 12:56
Re: jEmplode 42 pre 5
[Re: mschrag]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31594
Loc: Seattle, WA
|
No, it was called something else, but I don't remember the name. Arg.
|
Top
|
|
|
|
#113404 - 28/08/2002 12:57
Re: jEmplode 42 pre 5
[Re: tfabris]
|
pooh-bah
Registered: 09/09/2000
Posts: 2303
Loc: Richmond, VA
|
|
Top
|
|
|
|
#113405 - 28/08/2002 12:58
Re: jEmplode 42 pre 5
[Re: mschrag]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31594
Loc: Seattle, WA
|
|
Top
|
|
|
|
#113406 - 28/08/2002 13:44
Re: jEmplode 42 pre 5
[Re: tfabris]
|
carpal tunnel
Registered: 24/01/2002
Posts: 3937
Loc: Providence, RI
|
For tune-duplicate-checking, the tag information combined with file size is plenty.
I upload files I get from someone. A few days later I upload the same files with tags that are fixed with better/additional/correct data. They're still the same songs. I probably want an "option" to replace, but that might be too much work. I think what I don't want is both "songs".
|
Top
|
|
|
|
#113407 - 28/08/2002 13:57
Re: jEmplode 42 pre 5
[Re: Daria]
|
pooh-bah
Registered: 09/09/2000
Posts: 2303
Loc: Richmond, VA
|
Yeah -- There are really four scenarios here:
1) the tags are the same, the song is the same -- always should be rejected as a duplicate
2) the tags are different, the song is the same -- maybe we should reimport just the tags?
3) the song is different (but sounds the same) -- replace the song on the empeg
4) the song is different (and sounds different) -- add a new song
It looks like the code for songprint is only like 150 lines to compute the signature, but it uses FFTW (the Fast Fourier Transform library), which I'm sure is rather large. I need to either port FFTW to Java, find a replacement, or use the java wrappers and require the dll or .so (which would suck).
ms
|
Top
|
|
|
|
#113408 - 28/08/2002 13:58
Re: jEmplode 42 pre 5
[Re: tfabris]
|
pooh-bah
Registered: 09/09/2000
Posts: 2303
Loc: Richmond, VA
|
The only issue here is that I'm just CRCing the file and it slows things down ... How long is it going to take to compute a soundprint I wonder?
|
Top
|
|
|
|
#113409 - 28/08/2002 14:38
Re: jEmplode 42 pre 5
[Re: mschrag]
|
stranger
Registered: 22/02/2002
Posts: 55
Loc: Rignt here
|
maybe a check-box that asks lets you decide how you want this situation handled. skip, replace, or ask with each instance.
Sorry if this has already been offered...I haven't read through all posts.
_________________________
:: john
(24GB Mk2)
|
Top
|
|
|
|
|
|