microsoft · adhazel · Jan 16, 2025 · Jan 16, 2025 · Jan 16, 2025 · Jan 16, 2025
@@ -0,0 +1,4 @@
+{
+  "type": "patch",
+  "description": "Adding escape and quote characters to the pandas read_csv logic used by the csv file loader."
+}
@@ -19,7 +19,7 @@ If the embedding target is `all`, and you want to only embed a subset of these f
 
 ## Input Data
 
-Our pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. To configure how input data is handled, what fields are mapped over, and how timestamps are parsed, look for configuration values starting with `GRAPHRAG_INPUT_` below. In general, CSV-based data provides the most customizability. Each CSV should at least contain a `text` field (which can be mapped with environment variables), but it's helpful if they also have `title`, `timestamp`, and `source` fields. Additional fields can be included as well, which will land as extra fields on the `Document` table.
+Our pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. To configure how input data is handled, what fields are mapped over, and how timestamps are parsed, look for configuration values starting with `GRAPHRAG_INPUT_` below. In general, CSV-based data provides the most customizability. Each CSV should at least contain a `text` field (which can be mapped with environment variables), but it's helpful if they also have `title`, `timestamp`, and `source` fields. Additional fields can be included as well, which will land as extra fields on the `Document` table. The pipeline assumes a backslash (\) is used for the escape character and that the quote character is a double quote (").
 
 ## Base LLM Settings
 

@@ -35,7 +35,12 @@ async def load_file(path: str, group: dict | None) -> pd.DataFrame:
         if group is None:
             group = {}
         buffer = BytesIO(await storage.get(path, as_bytes=True))
-        data = pd.read_csv(buffer, encoding=config.encoding or "latin-1")
+        data = pd.read_csv(
+            buffer,
+            encoding=config.encoding or "latin-1",
+            escapechar="\\",
+            quotechar='"',
+        )
         additional_keys = group.keys()
         if len(additional_keys) > 0:
             data[[*additional_keys]] = data.apply(

@@ -90,6 +90,7 @@ tenacity = "^9.0.0"
 json-repair = "^0.30.3"
 tqdm = "^4.67.1"
 httpx = "^0.28.1"
+semversioner = "^2.0.5"
 
 [tool.poetry.group.dev.dependencies]
 coverage = "^7.6.9"