Pythonistas: Up for quick hack to test Dedup’ing 78rpm records using document clustering?

I think this could be a 1 day exploration at least to figure out if it will work, but it is beyond my python ability.

Idea: OCR the labels of our 78rpm records, then take an image of a new 78rpm record and list the ones that are close to it. I would think this could be done with a search engine, or it could be with document vectors (gensim). On mac yesterday I got the images and a lead from trusty stackexchange:

pip3 install internetarchive
brew install tesseract
#get the images of the labels (there are 350k of them, but can test with 1000)
ia search "collection:georgeblood" --itemlist | head -1000 | parallel -j10 'ia download {} --no-directories --format="Item Image"'
ls -1 *.jpg | parallel 'tesseract {} {}'

# then something like gensim e.g. https://stackoverflow.com/questions/42781292/doc2vec-get-most-similar-documents


Anyone up for helping test this theory? Again, I am thinking this is a one day hack. If it works, then it will take tuning and such, and the Archive, I would hope, could sponsor that phase.

We have many duplicates in the collection already, so testing this could be easy.

This entry was posted in Uncategorized. Bookmark the permalink.

1 Response to Pythonistas: Up for quick hack to test Dedup’ing 78rpm records using document clustering?

  1. art rhyno says:

    I am hardly a Pythonista but I think this is a promising approach. I put my tinkering here:
    https://github.com/artunit/identSim
    and the gensim steps are well described here:
    https://dev.to/thepylot/compare-documents-similarity-using-python-nlp-4odp
    I struggled more with the OCR bits, I haven’t worked a lot with materials that have the colour variations and text layouts found on record labels, and I found the default Tesseract processing needed some boosting. But if the text is strong enough, I think this has great potential.

Comments are closed.