You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am encountering a ConnectionRefusedError while running an OpenSearch indexing script. Initially, the script worked without issues, indexing approximately 200 files, each containing around 30,000 documents.
However, now it fails with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 198, in _new_connsock = connection.create_connection(File "/usr/local/lib/python3.8/dist-packages/urllib3/util/connection.py", line 85, in create_connectionraise errFile "/usr/local/lib/python3.8/dist-packages/urllib3/util/connection.py", line 73, in create_connectionsock.connect(sa)ConnectionRefusedError: [Errno 111] Connection refusedThe above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/opensearchpy/connection/http_urllib3.py", line 264, in perform_requestresponse = self.pool.urlopen(File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 847, in urlopenretries = retries.increment(File "/usr/local/lib/python3.8/dist-packages/urllib3/util/retry.py", line 445, in incrementraise reraise(type(error), error, _stacktrace)File "/usr/local/lib/python3.8/dist-packages/urllib3/util/util.py", line 39, in reraiseraise valueFile "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 793, in urlopenresponse = self._make_request(File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 491, in _make_requestraise new_eFile "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 467, in _make_requestself._validate_conn(conn)File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 1099, in _validate_connconn.connect()File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 616, in connectself.sock = sock = self._new_conn()File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 213, in _new_connraise NewConnectionError(urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fb798541430>: Failed to establish a new connection: [Errno 111] Connection refusedDuring handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "indexing.py", line 162, in <module>create_index(index_name, path, model, batch_size=100)File "indexing.py", line 139, in create_indexclient.bulk(body=batch, refresh=True)File "/usr/local/lib/python3.8/dist-packages/opensearchpy/client/utils.py", line 181, in _wrappedreturn func(*args, params=params, headers=headers, **kwargs)File "/usr/local/lib/python3.8/dist-packages/opensearchpy/client/__init__.py", line 462, in bulkreturn self.transport.perform_request(File "/usr/local/lib/python3.8/dist-packages/opensearchpy/transport.py", line 446, in perform_requestraise eFile "/usr/local/lib/python3.8/dist-packages/opensearchpy/transport.py", line 409, in perform_requeststatus, headers_response, data = connection.perform_request(File "/usr/local/lib/python3.8/dist-packages/opensearchpy/connection/http_urllib3.py", line 279, in perform_requestraise ConnectionError("N/A", str(e), e)opensearchpy.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPSConnection object at 0x7fb798541430>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPSConnection object at 0x7fb798541430>: Failed to establish a new connection: [Errno 111] Connection refused)
When encountering this error, I restart OpenSearch and resume indexing from the exact file where the process stopped. However, upon resuming, the same error occurs again.
Related component
Clients
To Reproduce
The script reads JSON files from a directory and indexes them into an OpenSearch cluster. It uses the opensearchpy library to interact with OpenSearch and the sentence-transformers library for document embeddings. This is the code:
importpandasaspdimportosimportjsonimporttorchimporttimefromopensearchpyimportOpenSearchfromsentence_transformersimportSentenceTransformerpath="../outputNJSONextracted"# Directory containing your JSON filesmodel_card='sentence-transformers/msmarco-distilbert-base-tas-b'device=torch.device("cuda"iftorch.cuda.is_available() else"cpu")
print(f"Device {device}")
host='127.0.0.1'#host = '54.93.99.186' port=9200auth= ('admin','IVIngi2024!') #('admin', 'admin') client=OpenSearch(
hosts= [{'host': host, 'port': port}],
http_auth=auth,
use_ssl=True,
verify_certs=False,
ssl_assert_hostname=False,
ssl_show_warn=False,
timeout=30,
max_retries=10
)
print("Connection opened...")
index_name='medline-faiss-hnsw-3'# change the index nameindex_body= {
"settings": {
"index": {
"knn": "true",
"refresh_interval" : -1,
#default_pipeline": "medline-ingest-pipeline", # embedding in script"number_of_shards": 5,
"number_of_replicas": 0
}
},
"mappings": {
"properties": {
"embedding_abstract": {
"type": "knn_vector",
"dimension": 768,
"method":{
"engine":"faiss",
"name": "hnsw",
"space_type": "innerproduct"
}
},
"title":{"type":"text"},
"abstract":{"type":"text"},
"pmid":{"type":"keyword"},
"journal":{"type":"text"},
"pubdate":{"type":"date"},
"authors":{"type": "text"}
}
}
}
response=client.indices.create(index_name, body=index_body)
print(response)
defcreate_index(index_name, directory_path, model, batch_size=100):
j=0documents=set()
files_number=0forfilenameinsorted(os.listdir(directory_path)):
start_time=time.time()
iffilename.endswith(".json"):
print(f"Starting indexing {filename} ...")
# Construct the full file pathfile_path=os.path.join(directory_path, filename)
# Read the JSON filewithopen(file_path, 'r') asfile:
# Initialize an empty list to store dictionariesdictionaries= []
# Read the file line by lineforlineinfile:
# Parse each line as JSON and append it to the listdictionaries.append(json.loads(line))
# Create a DataFramedf=pd.DataFrame(dictionaries)
# Select only the required columnsdf=df[['pmid', 'title', 'abstract', 'journal', 'authors', 'pubdate']]
# Output the file namebatch= []
fori, rowindf.iterrows():
pmid=row["pmid"]
ifpmidindocuments:
continueelse:
documents.add(pmid)
embedding=model.encode(row["abstract"])
doc= {
"pmid": pmid,
"abstract": row["abstract"],
"title": row["title"],
"authors": row['authors'],
"journal": row['journal'],
"pubdate": row['pubdate'],
"embedding_abstract": embedding
}
batch.append({"index": {"_index": index_name, "_id": pmid}})
batch.append(doc)
j+=1iflen(batch) >=batch_size*2:
client.bulk(body=batch, refresh=True)
batch= []
ifbatch:
client.bulk(body=batch, refresh=True)
print(f"Indexed remaining documents")
files_number+=1print(f"Processed file: {filename} in {time.time()-start_time}")
print("Number of currently documents indexed ",j)
iffiles_number%100==0:
print("-"*50)
print(f"Files indexed = {files_number}")
print()
print("Total documents inserted = ", j)
model=SentenceTransformer(model_card)
model.to(device)
print("Creating indexing...")
start=time.time()
create_index(index_name, path, model, batch_size=100)
print(f"Time neeeded {time.time() -start}")
Troubleshooting Steps Taken:
Increased heap size in jvm.options file to 8GB (the max that I can).
Attempted to reduce batch size to mitigate potential HTTP request size issues.
Despite these efforts, the script continues to encounter connection issues.
I suspect that the problem may be related to sending large HTTP requests or some configuration issue with the OpenSearch server. However, I'm unsure how to proceed in diagnosing and resolving the issue.
Any insights or suggestions would be greatly appreciated. Thank you!
Expected behavior
namal Connection
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
OS: [e.g. iOS]
Version [e.g. 22]
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
[Triage - attendees 123] @lijie123bes Thanks for creating this issue with full reproduction steps, moving this to the python client repo for further investigation.
Describe the bug
I am encountering a ConnectionRefusedError while running an OpenSearch indexing script. Initially, the script worked without issues, indexing approximately 200 files, each containing around 30,000 documents.
However, now it fails with the following error:
When encountering this error, I restart OpenSearch and resume indexing from the exact file where the process stopped. However, upon resuming, the same error occurs again.
Related component
Clients
To Reproduce
The script reads JSON files from a directory and indexes them into an OpenSearch cluster. It uses the opensearchpy library to interact with OpenSearch and the sentence-transformers library for document embeddings. This is the code:
Troubleshooting Steps Taken:
Increased heap size in jvm.options file to 8GB (the max that I can).
Attempted to reduce batch size to mitigate potential HTTP request size issues.
Despite these efforts, the script continues to encounter connection issues.
I suspect that the problem may be related to sending large HTTP requests or some configuration issue with the OpenSearch server. However, I'm unsure how to proceed in diagnosing and resolving the issue.
Any insights or suggestions would be greatly appreciated. Thank you!
Expected behavior
namal Connection
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: