Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection error: HTTPConnectionPool(...): Read timed out. #775

Closed
gotjoshua opened this issue Dec 13, 2019 · 13 comments
Closed

Connection error: HTTPConnectionPool(...): Read timed out. #775

gotjoshua opened this issue Dec 13, 2019 · 13 comments

Comments

@gotjoshua
Copy link

What's your browser and operating system?

Chrome 79 on macOS (High Sierra)

What is the problem?

Connection error: HTTPConnectionPool(host='webodm_node-odm_1', port=3000): Read timed out. (read timeout=30)

What should be the expected behavior?

web-odm can reconnect to a running nodeodm processing node

How can we reproduce this?

29GB for 5800 images. DEM+DTM settings. I allocated 2cpus and 48GB RAM to the nodeodm container, the CPUs are running flat out. I can hit webodm_node-odm_1':3000 from broker/redis/ and webapp command line (if i bash into the containers and run curl http://webodm_node-odm_1':3000)

webapp is accessible via traefik, the rest of the containers are on a separate network,
docker-compose.yml:

version: "2.3"

networks:
  odm-net:
    external: false
  haven:
    external:
      name: haven

volumes:
  dbdata:
    driver: local-persist
    driver_opts:
      mountpoint: /mnt/raid0/srv/volumes/odm-db
  appmedia: #should be == ${WO_MEDIA_DIR}
    driver: local-persist
    driver_opts:
      mountpoint: /mnt/raid0/srv/volumes/odm-appmedia
services:
  db:
    image: opendronemap/webodm_db
    container_name: db
    ports:
      - "5432"
    volumes:
      - dbdata:/var/lib/postgresql/data
    restart: on-failure:10
    oom_score_adj: -100
    networks:
      - odm-net
  webapp:
    image: opendronemap/webodm_webapp
    container_name: webapp
    entrypoint: /bin/bash -c "chmod +x /webodm/*.sh && /bin/bash -c \"/webodm/wait-for-postgres.sh db /webodm/wait-for-it.sh -t 0 broker:6379 -- /webodm/start.sh\""
    volumes:
      - ${WO_MEDIA_DIR}:/webodm/app/media
    expose:
      - 8000
    labels:
      traefik.enable: true
      traefik.app.frontend.rule: Host:odm.internal.tld
      traefik.app.port: 8000 #internal port to be exposed on https://HOSTNAME
      traefik.docker.network: haven
    depends_on:
      - db
      - broker
      - worker
      - node-odm
    environment:
      - WO_PORT
      - WO_HOST
      - WO_DEBUG
      - WO_BROKER
      - WO_DEV
      - WO_DEFAULT_NODES
    restart: on-failure:10
    oom_score_adj: 0
    networks:
      - odm-net
      - haven
  broker:
    image: redis
    container_name: broker
    restart: on-failure:10
    oom_score_adj: -500
    networks:
      - odm-net
  node-odm:
    image: opendronemap/nodeodm
    container_name: webodm_node-odm_1
    mem_limit: 48g
    memswap_limit: 60g
    cpus: 2
    cpuset: 6,7
    volumes:
      - /V1/srv/odm/node1data:/var/www/data
      - /V1/srv/odm/node1tmp:/var/www/tmp
    restart: on-failure:10
    oom_score_adj: 500
    networks:
      - odm-net
  node-odm_2:
    image: opendronemap/nodeodm
    container_name: webodm_node-odm_2
    mem_limit: 48g
    memswap_limit: 60g
    cpus: 2
    cpuset: 5,6
    volumes:
      - /V1/srv/odm/node2data:/var/www/data
      - /V1/srv/odm/node2tmp:/var/www/tmp
    restart: unless-stopped
    oom_score_adj: 500
    networks:
      - odm-net
  worker:
    image: opendronemap/webodm_webapp
    container_name: worker
    entrypoint: /bin/bash -c "/webodm/wait-for-postgres.sh db /webodm/wait-for-it.sh -t 0 broker:6379 -- /webodm/wait-for-it.sh -t 0 webapp:8000 -- /webodm/worker.sh start"
    volumes:
      - ${WO_MEDIA_DIR}:/webodm/app/media
    depends_on:
      - db
      - broker
    environment:
      - WO_BROKER
      - WO_DEBUG
    restart: on-failure:10
    oom_score_adj: 250
    networks:
      - odm-net

Maybe the nodeodm container is too busy processing to bother answering to webapp... but then I would expect it to be impossible to get an answer from curl http://webodm_node-odm_1':3000

Does webodm continue to retry, or once it gets the read time out, then it gives up?

Sorry if this is already known, i couldn't find anything about this error, and it seems that the task is actually running, so I expect webodm to update the time that it is running to show that it is still alive and ongoing.

@pierotofy
Copy link
Member

pierotofy commented Dec 14, 2019

We should probably try display a better error message (somehow?), but the reason you got a timeout error is most likely because your processing node ran out of memory and the operating system started killing processes at random to get the memory back, probably causing the timeout.

29GB of RAM are probably enough to process ~1000-1500 images, not 6500 (without split-merge, see https://docs.opendronemap.org/large.html)

@gotjoshua
Copy link
Author

Thanks for the explanation and the pointer to the split option!

2 questions:

  • Will the system automatically split the split across multiple processing nodes?
  • How often does the UI retry to connect to the processing node? (I had the feeling that it was a stale error message)

here is the json response for the processing list:

[
   {
      "id":"9e04ce81-503b-4b60-ac0d-3190ef704696",
      "project":1,
      "processing_node":1,
      "processing_node_name":"node-odm-1",
      "can_rerun_from":[
         "",
         "dataset",
         "split",
         "merge",
         "opensfm",
         "mve",
         "odm_filterpoints",
         "odm_meshing",
         "mvs_texturing",
         "odm_georeferencing",
         "odm_dem",
         "odm_orthophoto"
      ],
      "uuid":"",
      "name":"2018 Unified",
      "processing_time":-1,
      "auto_processing_node":false,
      "status":30,
      "last_error":"Connection error: HTTPConnectionPool(host='webodm_node-odm_1', port=3000): Read timed out. (read timeout=30)",
      "options":[
         {
            "name":"pc-classify",
            "value":true
         },
         {
            "name":"orthophoto-resolution",
            "value":"4"
         },
         {
            "name":"dtm",
            "value":true
         },
         {
            "name":"dem-resolution",
            "value":"10"
         },
         {
            "name":"dsm",
            "value":true
         },
         {
            "name":"verbose",
            "value":true
         }
      ],
      "available_assets":[

      ],
      "created_at":"2019-12-13T20:13:09.149805Z",
      "pending_action":null,
      "public":false,
      "resize_to":-1,
      "upload_progress":0.0,
      "resize_progress":0.0,
      "running_progress":0.0,
      "import_url":"",
      "images_count":5863,
      "partial":false
   }
]

@gotjoshua
Copy link
Author

Hello again, I am still getting this timeout error, even though I can see that the task is still running on the nodeODM web interface (but not in the WebODM dashboard):
Screen Shot 2020-04-21 at 9 15 02 PM
Screen Shot 2020-04-21 at 9 14 52 PM

Any ideas why this is happening? or how to fix it?

@gotjoshua
Copy link
Author

I have the feeling that once the task gets a timeout it doesn't check properly anymore. Uploading always seems to work fine, and sometimes I am able to see the task output in the dashboard. But once it times out once, it seems that even fully refreshing the page does not reconnect to the processing node.

The last job finished and I needed to download the zip file from the processing node and then upload it into the dashboard.

@pierotofy
Copy link
Member

Possibly related: #727

@tsmock
Copy link
Contributor

tsmock commented Mar 5, 2021

I've done a little bit of digging, and I believe that the reason why this happens is that we don't get a response with a UUID (needed to track the progress) from (see PyODM/api.py#L268 for the start of the code).

Potential solution:

  • Modify PyODM/api.py so that if an error occurs, set an attribute of uuid, and then reraise the exception. In WebODM, we could then check if we have a UUID, at which point, we can continue on.

I've made two PRs with the (potential) solution. I still need to verify it.

See OpenDroneMap/PyODM#17 and #966.

@tsmock
Copy link
Contributor

tsmock commented Mar 6, 2021

Workaround:

$ docker exec -it db bash
# psql --user postgres --db webodm_dev
SELECT id,project_id,name,uuid,status,pending_action,last_error FROM app_task WHERE uuid=''; # It may be `uuid is NULL` instead of `uuid=''`
UPDATE app_task SET uuid='<UUID from worker>',status=20,last_error=NULL WHERE id='<id with missing uuid>';

@pierotofy
Copy link
Member

pierotofy commented Mar 6, 2021

Hey @tsmock ✋ thanks for looking into this!

The idea of having create_task return a UUID in case of error is a bit... of a hack (I like it, but there might be a better way). The idea is that creating a task is either a pass or a fail. If it fails, there shouldn't be a UUID.

The case of this error seems that:

  1. Upload begins
  2. create_task returns a timeout error for a network call somewhere (either /task/new/upload or based on the observations, much more likely /task/new/commit, which exceeds the 30 seconds limit)
  3. WebODM records the timeout error (assumes a fail), but eventually the request completes in NodeODM and the task happily starts, without WebODM knowing.

To me this points to something we should fix in NodeODM https://github.com/OpenDroneMap/NodeODM/blob/master/libs/taskNew.js#L227. createTask could take a bit of time, especially if there are a lot of images or the hard drive is slow (or some other reason).

Another (simpler) option could be to increase the timeout for the call to /task/new/commit in PyODM.

I think a simple way to test the hypothesis that this is indeed a problem with /task/new/commit would be to modify NodeODM's createTask function to have a long setTimeout (longer than 30 seconds). You can launch NodeODM in test mode with:

node index.js --test

From NodeODM's directory.

Let me know what you find!

@gotjoshua
Copy link
Author

gotjoshua commented Mar 6, 2021 via email

@tsmock
Copy link
Contributor

tsmock commented Mar 6, 2021

Hey @tsmock thanks for looking into this!

I looked into because it affected me. :)

The idea of having create_task return a UUID in case of error is a bit... of a hack (I like it, but there might be a better way). The idea is that creating a task is either a pass or a fail. If it fails, there shouldn't be a UUID.

I agree that it is a bit of a hack. However, it does (at least) put the UUID in the db so that we could (potentially) recheck every so often.

I was originally going to suggest increasing the timeout, but the issue is that no matter what the timeout is, it can still timeout but successfully create a task.

My other solution would have been to add a callback of some sort, such that the node can tell the caller "hey, I'm done now!"
I didn't go that route partly due to time constraints (some of my other personal projects have much harder deadlines than this).

So I went with the "UUID in error" route just so that we could record the UUID in the database, and take further action later (e.g., check and see if it is actually running on the node).

Anyway, I probably won't be able to test until next weekend (I've got a run going right now, and it is at the 21 hour mark so I really don't want to accidentally screw it up). Sometime I'll have to look into reusing older pipelines with additional images (it is still matching images together, and it's a project where I'm adding new imagery every week or so).

@pierotofy
Copy link
Member

I was originally going to suggest increasing the timeout, but the issue is that no matter what the timeout is, it can still timeout but successfully create a task.

You're perfectly right, not an ideal solution.

I think NodeODM should return immediately a UUID after a call to /task/new/commit, do its pre-process tasks in the background (report any errors as a "task failed" state change) instead of doing them as part of the call. That would make the call almost immediate and not require a timeout (as well as not requiring any changes to either WebODM or PyODM).

All of this assuming the call to /task/new/commit is indeed the cause of the issue (haven't verified this).

@pierotofy
Copy link
Member

Sorry for the late update; this should be fixed / significantly improved with the changes in OpenDroneMap/NodeODM@18a714b which will be merged in NodeODM soon. 🥂

@pierotofy
Copy link
Member

Changes are in. Please update and see if the issue persists?

If it does, feel free to re-open this. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants