Connection error: HTTPConnectionPool(...): Read timed out. #775

gotjoshua · 2019-12-13T22:49:13Z

What's your browser and operating system?

Chrome 79 on macOS (High Sierra)

What is the problem?

Connection error: HTTPConnectionPool(host='webodm_node-odm_1', port=3000): Read timed out. (read timeout=30)

What should be the expected behavior?

web-odm can reconnect to a running nodeodm processing node

How can we reproduce this?

29GB for 5800 images. DEM+DTM settings. I allocated 2cpus and 48GB RAM to the nodeodm container, the CPUs are running flat out. I can hit webodm_node-odm_1':3000 from broker/redis/ and webapp command line (if i bash into the containers and run curl http://webodm_node-odm_1':3000)

webapp is accessible via traefik, the rest of the containers are on a separate network,
docker-compose.yml:

version: "2.3"

networks:
  odm-net:
    external: false
  haven:
    external:
      name: haven

volumes:
  dbdata:
    driver: local-persist
    driver_opts:
      mountpoint: /mnt/raid0/srv/volumes/odm-db
  appmedia: #should be == ${WO_MEDIA_DIR}
    driver: local-persist
    driver_opts:
      mountpoint: /mnt/raid0/srv/volumes/odm-appmedia
services:
  db:
    image: opendronemap/webodm_db
    container_name: db
    ports:
      - "5432"
    volumes:
      - dbdata:/var/lib/postgresql/data
    restart: on-failure:10
    oom_score_adj: -100
    networks:
      - odm-net
  webapp:
    image: opendronemap/webodm_webapp
    container_name: webapp
    entrypoint: /bin/bash -c "chmod +x /webodm/*.sh && /bin/bash -c \"/webodm/wait-for-postgres.sh db /webodm/wait-for-it.sh -t 0 broker:6379 -- /webodm/start.sh\""
    volumes:
      - ${WO_MEDIA_DIR}:/webodm/app/media
    expose:
      - 8000
    labels:
      traefik.enable: true
      traefik.app.frontend.rule: Host:odm.internal.tld
      traefik.app.port: 8000 #internal port to be exposed on https://HOSTNAME
      traefik.docker.network: haven
    depends_on:
      - db
      - broker
      - worker
      - node-odm
    environment:
      - WO_PORT
      - WO_HOST
      - WO_DEBUG
      - WO_BROKER
      - WO_DEV
      - WO_DEFAULT_NODES
    restart: on-failure:10
    oom_score_adj: 0
    networks:
      - odm-net
      - haven
  broker:
    image: redis
    container_name: broker
    restart: on-failure:10
    oom_score_adj: -500
    networks:
      - odm-net
  node-odm:
    image: opendronemap/nodeodm
    container_name: webodm_node-odm_1
    mem_limit: 48g
    memswap_limit: 60g
    cpus: 2
    cpuset: 6,7
    volumes:
      - /V1/srv/odm/node1data:/var/www/data
      - /V1/srv/odm/node1tmp:/var/www/tmp
    restart: on-failure:10
    oom_score_adj: 500
    networks:
      - odm-net
  node-odm_2:
    image: opendronemap/nodeodm
    container_name: webodm_node-odm_2
    mem_limit: 48g
    memswap_limit: 60g
    cpus: 2
    cpuset: 5,6
    volumes:
      - /V1/srv/odm/node2data:/var/www/data
      - /V1/srv/odm/node2tmp:/var/www/tmp
    restart: unless-stopped
    oom_score_adj: 500
    networks:
      - odm-net
  worker:
    image: opendronemap/webodm_webapp
    container_name: worker
    entrypoint: /bin/bash -c "/webodm/wait-for-postgres.sh db /webodm/wait-for-it.sh -t 0 broker:6379 -- /webodm/wait-for-it.sh -t 0 webapp:8000 -- /webodm/worker.sh start"
    volumes:
      - ${WO_MEDIA_DIR}:/webodm/app/media
    depends_on:
      - db
      - broker
    environment:
      - WO_BROKER
      - WO_DEBUG
    restart: on-failure:10
    oom_score_adj: 250
    networks:
      - odm-net

Maybe the nodeodm container is too busy processing to bother answering to webapp... but then I would expect it to be impossible to get an answer from curl http://webodm_node-odm_1':3000

Does webodm continue to retry, or once it gets the read time out, then it gives up?

Sorry if this is already known, i couldn't find anything about this error, and it seems that the task is actually running, so I expect webodm to update the time that it is running to show that it is still alive and ongoing.

The text was updated successfully, but these errors were encountered:

pierotofy · 2019-12-14T02:27:35Z

We should probably try display a better error message (somehow?), but the reason you got a timeout error is most likely because your processing node ran out of memory and the operating system started killing processes at random to get the memory back, probably causing the timeout.

29GB of RAM are probably enough to process ~1000-1500 images, not 6500 (without split-merge, see https://docs.opendronemap.org/large.html)

gotjoshua · 2019-12-14T07:34:12Z

Thanks for the explanation and the pointer to the split option!

2 questions:

Will the system automatically split the split across multiple processing nodes?
How often does the UI retry to connect to the processing node? (I had the feeling that it was a stale error message)

here is the json response for the processing list:

[
   {
      "id":"9e04ce81-503b-4b60-ac0d-3190ef704696",
      "project":1,
      "processing_node":1,
      "processing_node_name":"node-odm-1",
      "can_rerun_from":[
         "",
         "dataset",
         "split",
         "merge",
         "opensfm",
         "mve",
         "odm_filterpoints",
         "odm_meshing",
         "mvs_texturing",
         "odm_georeferencing",
         "odm_dem",
         "odm_orthophoto"
      ],
      "uuid":"",
      "name":"2018 Unified",
      "processing_time":-1,
      "auto_processing_node":false,
      "status":30,
      "last_error":"Connection error: HTTPConnectionPool(host='webodm_node-odm_1', port=3000): Read timed out. (read timeout=30)",
      "options":[
         {
            "name":"pc-classify",
            "value":true
         },
         {
            "name":"orthophoto-resolution",
            "value":"4"
         },
         {
            "name":"dtm",
            "value":true
         },
         {
            "name":"dem-resolution",
            "value":"10"
         },
         {
            "name":"dsm",
            "value":true
         },
         {
            "name":"verbose",
            "value":true
         }
      ],
      "available_assets":[

      ],
      "created_at":"2019-12-13T20:13:09.149805Z",
      "pending_action":null,
      "public":false,
      "resize_to":-1,
      "upload_progress":0.0,
      "resize_progress":0.0,
      "running_progress":0.0,
      "import_url":"",
      "images_count":5863,
      "partial":false
   }
]

gotjoshua · 2020-04-21T20:21:52Z

Hello again, I am still getting this timeout error, even though I can see that the task is still running on the nodeODM web interface (but not in the WebODM dashboard):

Any ideas why this is happening? or how to fix it?

gotjoshua · 2020-04-22T14:19:29Z

I have the feeling that once the task gets a timeout it doesn't check properly anymore. Uploading always seems to work fine, and sometimes I am able to see the task output in the dashboard. But once it times out once, it seems that even fully refreshing the page does not reconnect to the processing node.

The last job finished and I needed to download the zip file from the processing node and then upload it into the dashboard.

pierotofy · 2020-04-22T14:26:17Z

Possibly related: #727

tsmock · 2021-03-05T23:59:16Z

I've done a little bit of digging, and I believe that the reason why this happens is that we don't get a response with a UUID (needed to track the progress) from (see PyODM/api.py#L268 for the start of the code).

Potential solution:

Modify PyODM/api.py so that if an error occurs, set an attribute of uuid, and then reraise the exception. In WebODM, we could then check if we have a UUID, at which point, we can continue on.

I've made two PRs with the (potential) solution. I still need to verify it.

See OpenDroneMap/PyODM#17 and #966.

tsmock · 2021-03-06T00:37:56Z

Workaround:

$ docker exec -it db bash
# psql --user postgres --db webodm_dev

SELECT id,project_id,name,uuid,status,pending_action,last_error FROM app_task WHERE uuid=''; # It may be `uuid is NULL` instead of `uuid=''`
UPDATE app_task SET uuid='<UUID from worker>',status=20,last_error=NULL WHERE id='<id with missing uuid>';

pierotofy · 2021-03-06T15:29:51Z

Hey @tsmock ✋ thanks for looking into this!

The idea of having create_task return a UUID in case of error is a bit... of a hack (I like it, but there might be a better way). The idea is that creating a task is either a pass or a fail. If it fails, there shouldn't be a UUID.

The case of this error seems that:

Upload begins
create_task returns a timeout error for a network call somewhere (either /task/new/upload or based on the observations, much more likely /task/new/commit, which exceeds the 30 seconds limit)
WebODM records the timeout error (assumes a fail), but eventually the request completes in NodeODM and the task happily starts, without WebODM knowing.

To me this points to something we should fix in NodeODM https://github.com/OpenDroneMap/NodeODM/blob/master/libs/taskNew.js#L227. createTask could take a bit of time, especially if there are a lot of images or the hard drive is slow (or some other reason).

Another (simpler) option could be to increase the timeout for the call to /task/new/commit in PyODM.

I think a simple way to test the hypothesis that this is indeed a problem with /task/new/commit would be to modify NodeODM's createTask function to have a long setTimeout (longer than 30 seconds). You can launch NodeODM in test mode with:

node index.js --test

From NodeODM's directory.

Let me know what you find!

gotjoshua · 2021-03-06T18:18:11Z

I think that the assumption is the problem

WebODM records the timeout error (assumes a fail),

Longer time out is a start. Plus, some retries could make sense!

…

On Sat, Mar 6, 2021 at 3:30 PM Piero Toffanin ***@***.***> wrote: Hey @tsmock <https://github.com/tsmock> ✋ thanks for looking into this! The idea of having create_task return a UUID in case of error is a bit... of a hack (I like it, but there might be a better way). The idea is that creating a task is either a pass or a fail. If it fails, there shouldn't be a UUID. The case of this error seems that: 1. Upload begins 2. create_task returns a timeout error a call somewhere (either /task/new/upload or based on the observations, much more likely /task/new/commit exceeds the 30 seconds limit) 3. WebODM records the timeout error (assumes a fail), but eventually the request completes in NodeODM and the task happily starts, without WebODM knowing. To me this points to something we should fix in NodeODM https://github.com/OpenDroneMap/NodeODM/blob/master/libs/taskNew.js#L227 Another (simpler) option could be to increase the timeout for the call to /task/new/commit in PyODM. I think a simple way to test the hypothesis that this is indeed a problem with /task/new/commit would be to modify NodeODM's createTask function to have a long setTimeout (longer than 30 seconds). You can launch NodeODM in test mode with: node index.js --test From NodeODM's directory. Let me know what you find! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#775 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCIW5VGUYYVMC5MLEPBXD3TCJC7XANCNFSM4J2XBQRQ> .

tsmock · 2021-03-06T18:19:37Z

Hey @tsmock thanks for looking into this!

I looked into because it affected me. :)

The idea of having create_task return a UUID in case of error is a bit... of a hack (I like it, but there might be a better way). The idea is that creating a task is either a pass or a fail. If it fails, there shouldn't be a UUID.

I agree that it is a bit of a hack. However, it does (at least) put the UUID in the db so that we could (potentially) recheck every so often.

I was originally going to suggest increasing the timeout, but the issue is that no matter what the timeout is, it can still timeout but successfully create a task.

My other solution would have been to add a callback of some sort, such that the node can tell the caller "hey, I'm done now!"
I didn't go that route partly due to time constraints (some of my other personal projects have much harder deadlines than this).

So I went with the "UUID in error" route just so that we could record the UUID in the database, and take further action later (e.g., check and see if it is actually running on the node).

Anyway, I probably won't be able to test until next weekend (I've got a run going right now, and it is at the 21 hour mark so I really don't want to accidentally screw it up). Sometime I'll have to look into reusing older pipelines with additional images (it is still matching images together, and it's a project where I'm adding new imagery every week or so).

pierotofy · 2021-03-07T23:59:39Z

I was originally going to suggest increasing the timeout, but the issue is that no matter what the timeout is, it can still timeout but successfully create a task.

You're perfectly right, not an ideal solution.

I think NodeODM should return immediately a UUID after a call to /task/new/commit, do its pre-process tasks in the background (report any errors as a "task failed" state change) instead of doing them as part of the call. That would make the call almost immediate and not require a timeout (as well as not requiring any changes to either WebODM or PyODM).

All of this assuming the call to /task/new/commit is indeed the cause of the issue (haven't verified this).

pierotofy · 2021-05-24T22:05:53Z

Sorry for the late update; this should be fixed / significantly improved with the changes in OpenDroneMap/NodeODM@18a714b which will be merged in NodeODM soon. 🥂

pierotofy · 2021-05-25T15:31:54Z

Changes are in. Please update and see if the issue persists?

If it does, feel free to re-open this. 🙏

pierotofy added help wanted improvements labels Dec 14, 2019

tsmock mentioned this issue Mar 5, 2021

Add try-catch block to ensure that a user can get a uuid from an exception OpenDroneMap/PyODM#17

Closed

pierotofy closed this as completed May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection error: HTTPConnectionPool(...): Read timed out. #775

Connection error: HTTPConnectionPool(...): Read timed out. #775

gotjoshua commented Dec 13, 2019

pierotofy commented Dec 14, 2019 •

edited

Loading

gotjoshua commented Dec 14, 2019

gotjoshua commented Apr 21, 2020

gotjoshua commented Apr 22, 2020

pierotofy commented Apr 22, 2020

tsmock commented Mar 5, 2021

tsmock commented Mar 6, 2021 •

edited

Loading

pierotofy commented Mar 6, 2021 •

edited

Loading

gotjoshua commented Mar 6, 2021 via email

tsmock commented Mar 6, 2021

pierotofy commented Mar 7, 2021

pierotofy commented May 24, 2021

pierotofy commented May 25, 2021

Connection error: HTTPConnectionPool(...): Read timed out. #775

Connection error: HTTPConnectionPool(...): Read timed out. #775

Comments

gotjoshua commented Dec 13, 2019

What's your browser and operating system?

What is the problem?

What should be the expected behavior?

How can we reproduce this?

pierotofy commented Dec 14, 2019 • edited Loading

gotjoshua commented Dec 14, 2019

gotjoshua commented Apr 21, 2020

gotjoshua commented Apr 22, 2020

pierotofy commented Apr 22, 2020

tsmock commented Mar 5, 2021

tsmock commented Mar 6, 2021 • edited Loading

pierotofy commented Mar 6, 2021 • edited Loading

gotjoshua commented Mar 6, 2021 via email

tsmock commented Mar 6, 2021

pierotofy commented Mar 7, 2021

pierotofy commented May 24, 2021

pierotofy commented May 25, 2021

pierotofy commented Dec 14, 2019 •

edited

Loading

tsmock commented Mar 6, 2021 •

edited

Loading

pierotofy commented Mar 6, 2021 •

edited

Loading