(final?) yak shaving tool for writing my Phd thesis

Pipeline

  1. scan the cards

    Register them in a db. Here it will assigned a id (serial int), filename and scannedat. There is no intent to use it in a distributed manner yet. We might modify the image to have a higher success rate for ocr. I might be using minio to store this

    ocr に入れるための準備コマンド:

    mogrify -path processed \
      -rotate -90 \
      -density 300 \
      -colorspace Gray \
      -contrast-stretch 0 \
      -despeckle \
    *.jpg
    
  2. ocr the images by using azure

    The images will be sent to Azure Read API to ocr the handwritten text. The result will be stored in a separate table.

curl -v -X POST "https://umesata-vision.cognitiveservices.azure.com/vision/v3.2/read/analyze?language=ja" -H "Ocp-Apim-Subscription-Key: ${AZURE_KEY}" -H "Content-Type: application/octet-stream" --data-binary "@image.jpg"

  1. recreate markdown document using llm

    using the ocr output, feed them to a llm to convert the result into markdown document. The purpose of the pass through is to eliminate ocr errors and reconstruct the content to a comprehensible chunk of data. I might modify this markdown to supplement information that was lost or need more followup. It will be organized with a timestamp to store history. (In other words there will be multiple versions under on id)

  2. substring it by sentences, and compute the embeddings. calculate the geometric mean to represent the whole card. only the latest version of the markdown document will be used for this.
  3. cluster the cards
  4. visualize (interface is by it’s self a whole project so later)

Date: 2025-02-08 Sat 15:01