Skip to content

Commit

Permalink
nlp edits
Browse files Browse the repository at this point in the history
  • Loading branch information
simon-mo committed Mar 4, 2024
1 parent 97dd1b0 commit 068dcbf
Showing 1 changed file with 4 additions and 155 deletions.
159 changes: 4 additions & 155 deletions gradproject.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ You can find the processed dataset in `./chatbot-arena-conversations.jsonl.gz`.
- Removed conversations with more than one round.
- Removed conversations classified as toxic or harmful.

The dataset you will be working with contains `25322` rows (out of `33000` total rows) and `7` columns. The columns are:
The dataset you will be working with contains `25322` rows (out of `33000` total rows) and `7` columns ([example row](https://gist.github.com/simon-mo/25c5d532bccc7f28b404cffdfe719e6e#file-example-row-json)). The columns are:

- `question_id`: A unique identifier for the question.
- `model_a`: The name of the first chatbot model.
Expand All @@ -175,170 +175,19 @@ The dataset you will be working with contains `25322` rows (out of `33000` total
- `conversation_a`: The conversation between the user and `model_a`.
- `conversation_b`: The conversation between the user and `model_b`.

<!-- collapse -->
<details>
<summary>An example of the data is as follows:</summary>
```json
{
"question_id": "c0fc42c6f5f14f2aa5a89f71f8553730",
"model_a": "vicuna-13b",
"model_b": "koala-13b",
"winner": "model_a",
"judge": "arena_user_973",
"conversation_a": [
{
"content": "Count from 1 to 10 with step = 3",
"role": "user"
},
{
"content": "1, 4, 7, 10\n\nCounting with a step of 3 means that you will increase the number by 3 each time. So, the first number is 1, the next is 4 (1 + 3), the next is 7 (4 + 3), and so on.",
"role": "assistant"
}
],
"conversation_b": [
{
"content": "Count from 1 to 10 with step = 3",
"role": "user"
},
{
"content": "1, 4, 7, 10",
"role": "assistant"
}
]
}
```
</details>

There are two axuliary dataset that you can use to help with your analysis:

- `./chatbot-arena-prompts-embeddings.npy` contains the 256 dimensional text embeddings for each of the human question. The embeddings are generated using OpenAI's `text-embedding` model. We will explain what is embedding and how can you use them later in this notebook. It has shape (25322, 256) and have dtype('float32').
- `./chatbot-arena-gpt3-scores.jsonl.gz` contains labels for the dataset you can use for later modeling tasks. It has the following fields:
- `./chatbot-arena-gpt3-scores.jsonl.gz` ([example row](https://gist.github.com/simon-mo/25c5d532bccc7f28b404cffdfe719e6e#file-example-aux-row-json)contains labels for the dataset you can use for later modeling tasks. It has the following fields:
- `question_id`: The unique identifier for the question, as seen in `./chatbot-arena-conversations.jsonl.gz`.
- `prompt`: The extracted human question. This is equivalent to the first message in `conversation_a` and `conversation_b` in `./chatbot-arena-conversations.jsonl.gz`.
- `openai_scores_raw_choices_nested`: The response from OpenAI GPT 3.5 model (see later for the prompt). It contains the evaluated topic model, reason for a hardness score from 1 to 10, and the value. For each prompt, we have 3 responses. We extracted the fields into the following columns.
- `topic_modeling_1`, `topic_modeling_2`, `topic_modeling_3`: The topic modeling for the first, second, and third response. Each topic should have two words.
- `score_reason_1`, `score_reason_2`, `score_reason_3`: The reason for the hardness score for the first, second, and third response.
- `score_value_1`, `score_value_2`, `score_value_3`: The hardness score for the first, second, and third response.

<details>
<summary>An example of the data is as follows:</summary>

```json
{
"question_id": "58210e39b3fd4441a2bd4a518bb44c2d",
"prompt": "What is the difference between OpenCL and CUDA?",
"openai_scores_raw_choices_nested": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "{\n \"topic_modeling\": \"Technical Comparison\",\n \"score_reason\": \"This prompt requires the AI to accurately compare and contrast two distinct technologies, OpenCL and CUDA. It assesses the AI's factual accuracy and knowledge of these technologies, as well as its ability to articulate the differences between them.\",\n \"score_value\": 9\n}",
"role": "assistant",
"function_call": null,
"tool_calls": null
}
},
{
"finish_reason": "stop",
"index": 1,
"logprobs": null,
"message": {
"content": "{\n \"topic_modeling\": \"Software Comparison\",\n \"score_reason\": \"This prompt assesses the AI's factual accuracy in distinguishing between two similar but distinct software frameworks.\",\n \"score_value\": 8\n}",
"role": "assistant",
"function_call": null,
"tool_calls": null
}
},
{
"finish_reason": "stop",
"index": 2,
"logprobs": null,
"message": {
"content": "{\n \"topic_modeling\": \"Comparison, Technology\",\n \"score_reason\": \"This prompt requires the AI to demonstrate knowledge of two different technologies, compare their features, and explain their distinctions. This task assesses the AI's factual accuracy and proficiency in understanding complex technological concepts.\",\n \"score_value\": 9\n}",
"role": "assistant",
"function_call": null,
"tool_calls": null
}
}
],
"topic_modeling_1": "Technical Comparison",
"score_reason_1": "This prompt requires the AI to accurately compare and contrast two distinct technologies, OpenCL and CUDA. It assesses the AI's factual accuracy and knowledge of these technologies, as well as its ability to articulate the differences between them.",
"score_value_1": 9,
"topic_modeling_2": "Software Comparison",
"score_reason_2": "This prompt assesses the AI's factual accuracy in distinguishing between two similar but distinct software frameworks.",
"score_value_2": 8,
"topic_modeling_3": "Comparison, Technology",
"score_reason_3": "This prompt requires the AI to demonstrate knowledge of two different technologies, compare their features, and explain their distinctions. This task assesses the AI's factual accuracy and proficiency in understanding complex technological concepts.",
"score_value_3": 9
}
```
</details>

<details>
<summary>We used the following prompt to generate the responses:</summary>


```markdown
We are interested in understanding how well the following input prompts can evaluate an
AI assistant’s proficiency in problem-solving ability, creativity, or adherence to real-world
facts. Your task is to assess each prompt based on its potential to gauge the AI’s capabilities
effectively in these areas.

For each prompt, carry out the following steps:

1. Topic Modeling: Use two words to describe the task intended.
2. Assess the Potential: Consider how challenging the prompt is, and how well it can
assess an AI’s problem-solving skills, creativity, or factual accuracy. Briefly explain your
reasoning.
3. Assign a Score: Assign a score on a scale of 1 to 10, with a higher score representing
a higher potential to evaluate the AI assistant’s proficiency effectively. Use double square
brackets to format your scores, like so: [[5]].

Guidelines for Scoring:
• High Score (8-10): Reserved for prompts that are particularly challenging and excellently designed to assess AI proficiency.
• Medium Score (4-7): Given to prompts that have a moderate potential to assess the AI’s
capabilities.
• Low Score (1-3): Allocated to prompts that are either too easy, ambiguous, or do not
adequately assess the AI’s capabilities.

Ensure to critically evaluate each prompt and avoid giving high scores to prompts that are
ambiguous or too straightforward.

The output MUST follow a JSON format:
{
"topic_modeling": "...",
"score_reason": "...",
"score_value": ...
}
```
</details>

<details>
<summary>You can generate your own embeddings</summary>

Note that the provided embeddings are generated using OpenAI's model, which means if you want generate new prompt, you needs an OpenAI developer account ($5 free credit for new signup otherwise needing a credit card). The code used to generate the embeddings is as follows:

```python
import openai
client = openai.OpenAI()
client.embeddings.create(
model="text-embedding-3-small", input=prompt, extra_body=dict(dimensions=256)
)
```

You are welcomed to use open source models (you can use huggingface on datahub!) to generate the embeddings for the new prompts. A good start would be using huggingface pipeline for embedding generation:

```python
from transformers import pipeline

embedder = pipeline("feature-extraction", model="distilbert-base-uncased")
embeddings = embedder(prompt)
```

You can see more in the [feature extraction pipeline documentation](https://huggingface.co/docs/transformers/v4.37.2/en/main_classes/pipelines#transformers.FeatureExtractionPipeline). It will result in dimension of 768 for each prompt.
We used [this prompt](https://gist.github.com/simon-mo/25c5d532bccc7f28b404cffdfe719e6e#file-prompt-md) to generate the responses. You are welcomed to generate your own ground truth data. You can generate your own embeddings following [guide here](https://gist.github.com/simon-mo/25c5d532bccc7f28b404cffdfe719e6e#file-using-your-own-embeddings-md).

</details>

#### Project Tasks

Expand All @@ -352,7 +201,7 @@ For EDA task, we expect plots and story telling. For modeling tasks, we expect y

#### Getting Started

To get started, we provide a notebook `nlp-chatbot-starter.ipynb` that demostrate how to load and inspect the data. The project is expected to be open ended!
To get started, we provide a notebook `nlp-chatbot-starter.ipynb` that demostrate how to load and inspect the data.

#### Resources

Expand Down

0 comments on commit 068dcbf

Please sign in to comment.