I think this could be a 1 day exploration at least to figure out if it will work, but it is beyond my python ability.
Idea: OCR the labels of our 78rpm records, then take an image of a new 78rpm record and list the ones that are close to it. I would think this could be done with a search engine, or it could be with document vectors (gensim). On mac yesterday I got the images and a lead from trusty stackexchange:
pip3 install internetarchive brew install tesseract #get the images of the labels (there are 350k of them, but can test with 1000) ia search "collection:georgeblood" --itemlist | head -1000 | parallel -j10 'ia download {} --no-directories --format="Item Image"' ls -1 *.jpg | parallel 'tesseract {} {}' # then something like gensim e.g. https://stackoverflow.com/questions/42781292/doc2vec-get-most-similar-documents
Anyone up for helping test this theory? Again, I am thinking this is a one day hack. If it works, then it will take tuning and such, and the Archive, I would hope, could sponsor that phase.
We have many duplicates in the collection already, so testing this could be easy.
I am hardly a Pythonista but I think this is a promising approach. I put my tinkering here:
https://github.com/artunit/identSim
and the gensim steps are well described here:
https://dev.to/thepylot/compare-documents-similarity-using-python-nlp-4odp
I struggled more with the OCR bits, I haven’t worked a lot with materials that have the colour variations and text layouts found on record labels, and I found the default Tesseract processing needed some boosting. But if the text is strong enough, I think this has great potential.