Pythonistas: Up for quick hack to test Dedup’ing 78rpm records using document clustering?

I think this could be a 1 day exploration at least to figure out if it will work, but it is beyond my python ability.

Idea: OCR the labels of our 78rpm records, then take an image of a new 78rpm record and list the ones that are close to it. I would think this could be done with a search engine, or it could be with document vectors (gensim). On mac yesterday I got the images and a lead from trusty stackexchange:

pip3 install internetarchive
brew install tesseract
#get the images of the labels (there are 350k of them, but can test with 1000)
ia search "collection:georgeblood" --itemlist | head -1000 | parallel -j10 'ia download {} --no-directories --format="Item Image"'
ls -1 *.jpg | parallel 'tesseract {} {}'

# then something like gensim e.g. https://stackoverflow.com/questions/42781292/doc2vec-get-most-similar-documents


Anyone up for helping test this theory? Again, I am thinking this is a one day hack. If it works, then it will take tuning and such, and the Archive, I would hope, could sponsor that phase.

We have many duplicates in the collection already, so testing this could be easy.

This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Pythonistas: Up for quick hack to test Dedup’ing 78rpm records using document clustering?

  1. art rhyno says:

    I am hardly a Pythonista but I think this is a promising approach. I put my tinkering here:
    https://github.com/artunit/identSim
    and the gensim steps are well described here:
    https://dev.to/thepylot/compare-documents-similarity-using-python-nlp-4odp
    I struggled more with the OCR bits, I haven’t worked a lot with materials that have the colour variations and text layouts found on record labels, and I found the default Tesseract processing needed some boosting. But if the text is strong enough, I think this has great potential.

Comments are closed.