Example Notebook

A real-world example notebook to verify Hugo notebook rendering.

Getting Started with Synthetic Data Generation Powered by NeMo Curator #

In the following notebook, we’ll be exploring all of the amazing out-of-the-box functionality of the NeMo Curator Synthetic Data Generation (SDG) tooling.

First, we’ll work through an example of a pipeline in a piece-wise fashion - spending time exploring exactly how flexible NeMo Curator’s SDG functionality is. Then, we’ll explore all of the built-in pipelines for generating synthetic data for a number of different tasks.

In order to get started, though, we’ll need to install NeMo Curator!

NOTE: Please ensure you meet the requirements before proceeding!

Installing NeMo Curator Dependencies #

We’ll install NeMo Curator from source! First, let’s git clone the repository.

In [1]:
!git clone https://github.com/NVIDIA/NeMo-Curator.git
%cd NeMo-Curator

[Cloning into 'NeMo-Curator'...
 remote: Enumerating objects: 2051, done.[K
 remote: Counting objects: 100% (1512/1512), done.[K
 remote: Compressing objects: 100% (837/837), done.[K
 remote: Total 2051 (delta 983), reused 1002 (delta 666), pack-reused 539 (from 1)[K
 Receiving objects: 100% (2051/2051), 2.28 MiB | 15.29 MiB/s, done.
 Resolving deltas: 100% (1236/1236), done.
 /home/chris/Code/NVIDIA/NeMo-Curator/tutorials/synthetic-data-hello-world/NeMo-Curator
]

Now, we can install the required libraries!

!pip install -qU wheel

In [3]:
!pip install -qU .

Using the NeMo Curator OpenAI Client #

To ensure compatibility within the NeMo Curator SDG tooling, we’re going to use a specialized OpenAI Client. This is based on the OpenAI Python API library - but with a few modifications to allow seamless use for generating Synthetic Data Generation.

NOTE: While we’re going to be relying on the build.nvidia.com API endpoints for this example notebook, you can use this same flow with a model deployed as an NVIDIA NIM for LLMs which can be found here.

You’ll need to make sure you have a NVIDIA API key - which you can obtain by following this process:

Login (or sign up) through build.nvidia.com.
Click the Get API Key button available on the the nvidia/nemotron-4-340b-instruct page, found here.

In [4]:
import os
import getpass

os.environ["NVIDIA_API_KEY"] = getpass.getpass("Please provide your API Catalogue NVIDIA API Key:")

Next, we’re going to want to initialize the base OpenAI client.

In [5]:
from openai import OpenAI

openai_client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=os.environ["NVIDIA_API_KEY"],
)

Now, we can initalize our NeMo Curator OpenAIClient!

In [6]:
from nemo_curator import OpenAIClient

curator_openai_client = OpenAIClient(openai_client)

[/home/chris/anaconda3/envs/curator_fresh/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
   from .autonotebook import tqdm as notebook_tqdm
]

Chat Model Usage #

Now we can look at how to use our NeMo Curator OpenAIClient to generate a response.

As you can see - the structure of the request is very close to the traditional OpenAI client!

In [7]:
responses = curator_openai_client.query_model(
    model="nvidia/nemotron-4-340b-instruct",
    messages=[
        {
            "role": "user",
            "content": "Write a limerick about the wonders of GPU computing.",
        }
    ],
    temperature=0.2,
    top_p=0.7,
    max_tokens=1024,
)
print(responses[0])

[In the realm of computing, where data's the king,
 GPU power makes everything sing.
 Parallel processing, so neat,
 Makes complex tasks a treat,
 A wonder of tech, it's truly a thing!
 
 With thousands of cores, in silicon etched,
 Through machine learning, they're well-matched.
 Nvidia, AMD, in the race,
 To accelerate every place,
 GPU computing, a marvel, is hatched.
 
 From gaming to AI, and scientific research,
 GPUs help us leap, not just lurch.
 So here's to the engineers, so bright,
 Who brought us this marvel, pure delight,
 GPU computing, a true gem, we search!
 
]

Reward Model Usage #

We can use the same client to query NVIDIA’s best-in-class Reward Model - Nemotron-4 340B Reward using the query_reward_model method of our OpenAIClient.

In [8]:
model = "nvidia/nemotron-4-340b-reward"

The query_reward_model method expects a conversation between a User and an Assistant.

In [9]:
messages = [
    {
        "role": "user", 
        "content": "I am going to Paris, what should I see?"
    },
    {
        "role": "assistant",
        "content": "Ah, Paris, the City of Light! There are so many amazing things to see and do in this beautiful city...",
    },
]

After that, we can simply fire off our request!

In [10]:
rewards = curator_openai_client.query_reward_model(messages=messages, model=model)
print(rewards)

[{'helpfulness': 1.4765625, 'correctness': 1.6171875, 'coherence': 3.21875, 'complexity': 0.640625, 'verbosity': 0.365234375}
]

The Nemotron-4 340B Reward model will provide us the scores (between 0 and 4) for each of the 5 SteerLM attributes:

Helpfulness: Overall helpfulness of the response to the prompt.
Correctness: Inclusion of all pertinent facts without errors.
Coherence: Consistency and clarity of expression.
Complexity: Intellectual depth required to write response (i.e. whether the response can be written by anyone with basic language competency or requires deep domain expertise).
Verbosity: Amount of detail included in the response, relative to what is asked for in the prompt.

These can be used as a filter for any of the individual attributes, or utilized to verify specific attributes.

Using The `NemotronGenerator` #

The NeMo Curator Synthetic Data Generation (SDG) features are primarily accessed through the NemotronGenerator class.

This useful wrapped helps expose both:

Pre-built SDG pipelines
A number of specific generation utilities, which we’ll explore in the following section of the notebook.

In [11]:
from nemo_curator.synthetic import NemotronGenerator

generator = NemotronGenerator(curator_openai_client)

If you’d like to skip forward to a specific pipeline, you can find them here:

Math Question Generation Pipeline
Writing Task Generation Pipeline
Open Question Generation Pipeline
Closed Question Generation Pipeline
Python Question Generation Pipeline
Dialogue Generation Pipeline
Two-Turn Prompt Generation Pipeline
Entity Classification
- Classify Math Entity
- Classify Python Entity

Exploring the Math Question Generation Pipeline #

Before heading into the pre-built pipelines, we’re going to “break-apart” an existing pipeline, in this case: the Math Question Generation Pipeline - and see the granular customization that Nemo Curator provides for each step.

We’re going to work through the following process, which is detailed in the Nemotron-4 340B Technical Report:

Generate n Macro Topics - Have our LLM generate n broad topics relating to daily life, the world, etc.
Generate n Sub Topics - Have our LLM take each Macro Topic and generate n topics relating to the Macro Topics.
Generate n Questions - Have our LLM take each subtopic and generate n questions related to that topic (at the desired level)

Let’s dive in!

Model Selection and Configs #

First, we’ll emulate the process as outlined by the Nemotron-4 340B Technical Report by selecting the Mixtral-8x7B-Instruct-v0.1 model, as well as some reasonable generation parameters.

In [12]:
model = "mistralai/mixtral-8x7b-instruct-v0.1"
model_kwargs = {
    "temperature": 0.1,
    "top_p": 0.9,
    "max_tokens": 1024,
}

Generating `n` Macro Topics #

Our first step is to generate our Macro Topics.

Let’s look at the prompt that drives this process as well, to get a better understanding of what’s happening “under the hood”:

"Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n"

To do this, we’ll use the generate_macro_topics method of our NemotronGenerator.

NOTE: All prompt templates are fully customizable, and we’ll take a look at how we can do that in the upcoming cells!

In [13]:
# define the number of macro topics to generate
n_macro_topics = 20

# generate macro topics
responses = generator.generate_macro_topics(
    n_macro_topics=n_macro_topics, 
    model=model, 
    model_kwargs=model_kwargs
)

Let’s take a look at our response:

In [14]:
print(responses[0])

[1. Climate Change and Environmental Impact
 2. Mental Health and Well-being
 3. Space Exploration and Astronomy
 4. Global Health and Pandemics
 5. Renewable Energy and Sustainable Living
 6. Artificial Intelligence and Machine Learning
 7. Biodiversity and Conservation
 8. Virtual Reality and Gaming
 9. Nutrition and Diet
 10. Social Media and Online Communication
 11. Genetics and Genetic Engineering
 12. E-commerce and Online Shopping
 13. Neuroscience and Human Behavior
 14. Disaster Preparedness and Response
 15. Quantum Computing and Cryptography
 16. Education and Lifelong Learning
 17. Cybersecurity and Data Privacy
 18. Biotechnology and Synthetic Biology
 19. Transportation and Urban Planning
 20. Human Rights and Social Justice.
]

While this is a great start - we’d love to have this response in a Python list.

Luckily for us, Nemo Curator has just the tool!

We’ll use the convert_response_to_yaml_list method to accomplish this goal.

NOTE: Currently, this method is quite strict - and so custom parsing might be required depending on model choice, and use-case.

In [27]:
from nemo_curator.synthetic.error import YamlConversionError

while True:
    try:
        topic_list = generator.convert_response_to_yaml_list(
            responses[0], model=model, model_kwargs=model_kwargs
        )
        break
    except YamlConversionError as e:
        print(f"Hit: {e}, Retrying...")
        responses = generator.generate_macro_topics(
            n_macro_topics=n_macro_topics, 
            model=model, 
            model_kwargs=model_kwargs
        )

Now our response has been converted into a Python list - which is perfect for our next step: Generating subtopics.

In [28]:
print(topic_list[0])

[Climate Change and Environmental Impact
]

Generating `n` subtopics #

We’ll proceed through the same process as we did above, but this time our prompt will reflect our desire to generate subtopics.

Let’s check it out:

"Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible."

Otherwise, we will use the generate_subtopics method to fire off this subtask.

In [17]:
# number of subtopics to generate
n_subtopics = 5

# generate subtopics
subtopic_responses = generator.generate_subtopics(
    macro_topic=topic_list[0], n_subtopics=n_subtopics, model=model
)

Again, if we look at our results - they are great, but they are not in a desirable format to integrate cleanly into a pipeline.

In [22]:
print(subtopic_responses[0])

[1. "Global Warming and the Role of Greenhouse Gases": This topic could cover the science behind global warming, the impact of human activities (such as burning fossil fuels) on greenhouse gas emissions, and potential solutions to reduce our carbon footprint.
 
 2. "Impact of Deforestation on Biodiversity and Climate Change": This topic could explore the importance of forests in maintaining the planet's biodiversity, their role in carbon sequestration, and the devastating effects of deforestation on both.
 
 3. "Climate Change and Ocean Acidification": This topic could delve into how increased carbon dioxide levels in the atmosphere are leading to ocean acidification, its impact on marine life, and potential consequences for the food chain and human societies.
 
 4. "Renewable Energy Sources and Sustainable Future": This topic could examine various types of renewable energy (solar, wind, hydro, etc.), their advantages and challenges, and how they can help mitigate climate change while ensuring a sustainable future.
 
 5. "Climate Change Mitigation and Adaptation Strategies": This topic could discuss different strategies to combat climate change, ranging from reducing emissions (mitigation) to coping with its impacts (adaptation). It could also look at international agreements like the Paris Agreement and national policies aimed at addressing climate change.
]

Let’s use our convert_response_to_yaml_list to clean this up!

In [29]:
while True:
    try:
        subtopic_list = generator.convert_response_to_yaml_list(
            subtopic_responses[0], model=model, model_kwargs=model_kwargs
        )
        break
    except YamlConversionError as e:
        print(f"Hit: {e}, Retrying...")
        subtopic_responses = generator.generate_subtopics(
            macro_topic=topic_list[0], n_subtopics=n_subtopics, model=model
        )

In [30]:

subtopic_list

[['Global Warming and the Role of Greenhouse Gases',
  'Impact of Deforestation on Biodiversity and Climate Change',
  'Climate Change and Ocean Acidification',
  'Renewable Energy Sources and Sustainable Future',
  'Climate Change Mitigation and Adaptation Strategies']]

Generating a Math problem #

We can now generate Math problems based on the generated topics/subtopics.

We can look at the default prompt to see how these questions are generated as we did with the other stages:

'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.'

Generating the Math problems is as easy as utilizing the generate_math_problem method.

In [38]:
question_responses = generator.generate_math_problem(
    topic=subtopic_list[0],
    n_openlines=10,
    model=model
)

Once again, we will need to clean our list of questions using the convert_response_to_yaml_list method of our generator!

In [39]:
while True:
    try:
        question_list = generator.convert_response_to_yaml_list(
            question_responses[0], model=model, model_kwargs=model_kwargs
        )
        break
    except YamlConversionError as e:
        print(f"Hit: {e}, Retrying with fewer examples...")
        question_responses = generator.generate_math_problem(
            topic=subtopic_list[0],
            n_openlines=5,
            model=model
        )
question_list

[Hit: Conversion introduced hallucinations. Original response:
1. If the current rate of carbon dioxide emissions is 50 billion tons per year and the concentration of CO2 in the atmosphere is currently 400 parts per million (ppm), assuming no removal or absorption, how many years will it take for the CO2 concentration to reach 500 ppm?
2. The Earth absorbs 24% of the solar energy it receives, while the rest is reflected back into space. If greenhouse gases cause the Earth to retain an additional 0.3% of the solar energy, what is the total percentage of solar energy that the Earth now retains?
3. If a factory releases 10,000 tons of CO2 per year and can be converted to use renewable energy, which would reduce its emissions to zero, how much will the global CO2 concentration decrease if the factory's emissions are completely eliminated after 10 years?
4. The greenhouse effect is responsible for trapping 0.03% of the total solar energy that reaches the Earth's surface. If the concentration of greenhouse gases in the atmosphere increases by 50%, how much more solar energy will be trapped, assuming a linear relationship?
5. Assume that the current global temperature increase due to greenhouse gas emissions is 0.01°C per year. If the total greenhouse gas emissions were to be reduced by 25% in the next 10 years, by how much would the temperature increase be reduced over the following 10 years?
6. Given that the average American produces 16.5 metric tons of CO2 per year, what percentage reduction in CO2 emissions would be needed to achieve the goal of keeping the global temperature increase below 1.5°C above pre-industrial levels, assuming all other factors remain constant?
7. If the global methane concentration is currently 1.8 parts per billion (ppb) and increases by 0.02 ppb per year due to human activities, how many years will it take for the methane concentration to reach 2.0 ppb, assuming no removal or absorption?
8. Assume that the current rate of deforestation releases 2.4 billion tons of CO2 per year. If all deforestation were stopped immediately, how much would the global CO2 concentration decrease after 50 years, assuming no other changes in emissions?
9. If a new technology can capture and store 90% of the CO2 emissions from a power plant, and the power plant emits 10,000 tons of CO2 per year, how much CO2 would be released into the atmosphere each year with the new technology?
10. Given that the global average temperature has increased by approximately 1°C since the pre-industrial era, and that this temperature increase is due to a 40% increase in the concentration of greenhouse gases, estimate the global average temperature increase if the concentration of greenhouse gases were to double.
Converted response:
['50 billion tons per year', '24.3%', '10,000 tons per year', '0.015%', '0.005°C per year', '66.25%', '55.56 years', '1.2 billion tons of CO2', '1,000 tons of CO2 per year', '2°C']
Hallucination:
24.3%, Retrying with fewer examples...
]

[['Carbon Footprint Calculation',
  'Greenhouse Gas Concentration Trends',
  'Global Temperature Change Estimation',
  'Absorption of Solar Radiation',
  "Climate Modeling a City's Temperature Increase"]]

Modifying the Prompts #

Nemo Curator gives us granular control of each of the prompts at every step of each pipeline - let’s look at how we can modify the prompts!

We’ll start with a simple example of modifying the prompt to another provided default.

You can find all available pre-constructed prompts here.

NOTE: It’s important that when you’re constructing new prompts you need to include the same placeholders ({topic}, {n_openlines}, etc.) to ensure smooth integration with NemotronGenerator

Using Alternative Prompts #

Let’s examine the MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE:

'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.'

Replacing our existing prompt template with this new one is as easy as including it in our prompt_template parameter in the generate_math_problem method.

In [40]:
from nemo_curator.synthetic import MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE  

easy_question_responses = generator.generate_math_problem(
    topic=subtopic_list[1],
    n_openlines=10,
    model=model,
    prompt_template=MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE
)

Again, notice that our response needs to be cleaned - and while we can use the convert_response_to_yaml_list method to help us, we can also produce custom parsing functions if required.

In [41]:
easy_question_responses[0]

["1. If a forest covering 10,000 square kilometers is cut down, approximately how many trees are lost? (Assuming an average of 500 trees per hectare and 1 hectare = 0.01 square kilometers)\n2. If deforestation continues at the current rate, how many years will it take for the world's rainforests to disappear completely? (Assuming the current rate is 150,000 square kilometers per year and the total area of rainforests is 11,500,000 square kilometers)\n3. If the average temperature increases by 0.2°C for every 1% decrease in forest cover, what will be the increase in temperature if 2% of the forest cover is lost?\n4. If a country has a carbon footprint of 500 million tons per year and decides to reduce it by planting trees that absorb 10,000 tons of carbon dioxide per square kilometer per year, how many square kilometers of forest would need to be planted to offset the entire carbon footprint?\n5. If a forest provides habitat for 400 species of birds and 30% of those species are threatened by deforestation, how many bird species are at risk?\n6. If a logging company harvests trees from a 500-hectare forest every 20 years, what is the annual deforestation rate?\n7. If 100,000 tons of carbon dioxide are released into the atmosphere each day due to deforestation, how many tons of carbon dioxide are released in a year (assuming 365 days in a year)?\n8. If 20% of the Amazon rainforest has been destroyed and the Amazon holds 400 billion tons of carbon, how much carbon has been released into the atmosphere due to deforestation?\n9. If a forest serves as a watershed for a city of 1 million people, what is the impact on the city's water supply if 50% of the forest is lost?\n10. If planting trees can increase biodiversity, and a tree species has 500 seeds per kilogram and each seed grows into a new tree, how many new trees can be created from 100 kilograms of seeds?"]

We’ll define a simple parsing function that will take our str response, split it into lines, and then remove the 1., 2,, etc.

In [42]:
import re

def parse_math_problem_response(response):
    response = response.split("\n")
    return [re.sub(r"^\d+\.\s", "", line) for line in response]

easy_question_list = parse_math_problem_response(easy_question_responses[0])
easy_question_list

[['If a forest covering 10,000 square kilometers is cut down, approximately how many trees are lost? (Assuming an average of 500 trees per hectare and 1 hectare = 0.01 square kilometers)',
  "If deforestation continues at the current rate, how many years will it take for the world's rainforests to disappear completely? (Assuming the current rate is 150,000 square kilometers per year and the total area of rainforests is 11,500,000 square kilometers)",
  'If the average temperature increases by 0.2°C for every 1% decrease in forest cover, what will be the increase in temperature if 2% of the forest cover is lost?',
  'If a country has a carbon footprint of 500 million tons per year and decides to reduce it by planting trees that absorb 10,000 tons of carbon dioxide per square kilometer per year, how many square kilometers of forest would need to be planted to offset the entire carbon footprint?',
  'If a forest provides habitat for 400 species of birds and 30% of those species are threatened by deforestation, how many bird species are at risk?',
  'If a logging company harvests trees from a 500-hectare forest every 20 years, what is the annual deforestation rate?',
  'If 100,000 tons of carbon dioxide are released into the atmosphere each day due to deforestation, how many tons of carbon dioxide are released in a year (assuming 365 days in a year)?',
  'If 20% of the Amazon rainforest has been destroyed and the Amazon holds 400 billion tons of carbon, how much carbon has been released into the atmosphere due to deforestation?',
  "If a forest serves as a watershed for a city of 1 million people, what is the impact on the city's water supply if 50% of the forest is lost?",
  'If planting trees can increase biodiversity, and a tree species has 500 seeds per kilogram and each seed grows into a new tree, how many new trees can be created from 100 kilograms of seeds?']]

Creating Custom Prompts #

We can also define our own custom prompts - again, making sure that the placeholder variables are consistent between the original prompt and the newly constructed prompt.

In [43]:
DIFFICULT_MATH_PROMPT = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". These problems should be extremely advanced and only solvable by experts who have spent many years learning "{topic}". Your answer should be a list of problems, do not name the problems. Make them as diverse as possible.'

Once again, we’ll generate and then parse into a list!

In [45]:
# generate difficult math problems
difficult_question_responses = generator.generate_math_problem(
    topic=subtopic_list[1],
    n_openlines=10,
    model=model,
    prompt_template=DIFFICULT_MATH_PROMPT
)
while True:
    try:
        difficult_question_list = generator.convert_response_to_yaml_list(
            difficult_question_responses[0], model=model, model_kwargs=model_kwargs
        )
        break
    except YamlConversionError as e:
        print(f"Hit: {e}, Retrying with fewer examples...")
        difficult_question_responses = generator.generate_math_problem(
            topic=subtopic_list[1],
            n_openlines=5,
            model=model,
            prompt_template=DIFFICULT_MATH_PROMPT
        )
difficult_question_list

[['Develop a mathematical model to quantify the relationship between deforestation, carbon sequestration, and the global carbon budget, taking into account the impacts on biodiversity and climate change',
  'Analyze the impact of various deforestation scenarios on species diversity and extinction rates using advanced mathematical techniques such as population dynamics models and biodiversity indices',
  'Create a complex mathematical function to estimate the change in surface temperature and precipitation patterns as a result of deforestation-induced climate change',
  'Utilize statistical methods to assess the correlation between deforestation rates and changes in local and regional climate patterns, controlling for other factors such as land use and anthropogenic emissions',
  'Develop a mathematical framework to evaluate the optimal balance between deforestation for land use and the preservation of biodiversity and climate stability',
  'Use mathematical modeling to predict the long-term impacts of deforestation on the global carbon cycle and the feedback loops between carbon sinks and sources',
  'Utilize probability theory and stochastic processes to model the variability and uncertainty in the relationship between deforestation, biodiversity, and climate change',
  'Develop a mathematical method for integrating the impacts of deforestation on biodiversity and climate change into economic cost-benefit analyses and decision-making frameworks',
  'Use optimization algorithms to identify the most effective strategies for reducing deforestation and mitigating its impacts on biodiversity and climate change',
  'Use differential equations to model the complex interactions between deforestation, biodiversity loss, and climate change, and develop strategies for managing these systems in a sustainable manner']]

Async OpenAI Client Usage #

Now that we’ve explored a single deconstructed pipeline, we’ll work through a number of fantastic built-in pipelines that can be used for a variety of tasks.

Before doing that, however, we’ll instantiate an asyncronous client and generatore to allow us to generate responses more efficiently!

In [46]:
from openai import AsyncOpenAI
from nemo_curator import AsyncOpenAIClient
from nemo_curator.synthetic import AsyncNemotronGenerator

openai_client = AsyncOpenAI(
    base_url="https://integrate.api.nvidia.com/v1", api_key=os.environ["NVIDIA_API_KEY"]
)
client = AsyncOpenAIClient(openai_client)
generator = AsyncNemotronGenerator(client, max_concurrent_requests=10)

We’ll be leveraging the Nemotron-4 340B Instruct model for the built-in pipelines - but you can substitute any model that is compatible with the OpenAI API spec.

In [47]:
model = "nvidia/nemotron-4-340b-instruct"

Built-In SDG Pipelines #

Math Question Generation Pipeline
Writing Task Generation Pipeline
Open Question Generation Pipeline
Closed Question Generation Pipeline
Python Question Generation Pipeline
Dialogue Generation Pipeline
Two-Turn Prompt Generation Pipeline
Entity Classification
- Classify Math Entity
- Classify Python Entity

A Note on `ignore_conversion_failure=True` #

Due to the variety of models, and the variety of prompts - conversion from a str output to a Python list will not always be successful. Due to this, it is currently suggested to set ignore_conversion_failure=True to avoid the pipeline breaking down during generation.

This will impact the total number of generated entities.

Math Question Generation Pipeline #

The run_math_pipeline can be used to generate Math questions at various school levels.

NOTE: The school_level parameter will influence the generation of Macro Topics.

In [65]:
math_questions = await generator.run_math_pipeline(
    n_macro_topics=5,
    school_level="university",
    n_subtopics=5,
    n_openlines=10,
    model=model,
    ignore_conversion_failure=True
)
print(math_questions[0])

[Let (X,Σ,μ) be a measure space and let f:X→ℝ be a Σ-measurable function. Prove that the set {x∈X:f(x)≥t} is Σ-measurable for all t∈ℝ.
]

Writing Task Generation Pipeline #

The run_writing_pipeline can be used to generate various forms of writing tasks based on provided topics.

NOTE: You could use a topic generation pipeline to generate the seed topics for this pipeline.

In [86]:
writing_tasks = await generator.run_writing_pipeline(
    topics=[
        "Climate Change and Sustainable Living",
        "Space Exploration and the Universe",
    ],
    text_material_types=["Poems", "Essays"],
    n_openlines=5,
    n_revisions=2,
    model=model,
    ignore_conversion_failure=True
)

[100%|██████████| 4/4 [00:30<00:00,  7.71s/it]
 100%|██████████| 1/1 [00:30<00:00, 30.83s/it]
 100%|██████████| 10/10 [01:03<00:00,  6.30s/it]
 100%|██████████| 10/10 [00:56<00:00,  5.63s/it]
 100%|██████████| 2/2 [01:59<00:00, 59.68s/it]
]

In [132]:
writing_tasks[:5]

[['Compose a 14-line sonnet in iambic pentameter, praising the beauty and importance of wind and solar power in sustainable living. The sonnet must include at least two concrete examples of how these renewable energy sources contribute to reducing carbon emissions and a reference to their growing global capacity. (Requirement 1, 2, 3)',
  'Create a sonnet that highlights the role of wind and solar power in combating climate change, using vivid imagery to describe their functionality and aesthetics. The poem should incorporate at least one data point about the increasing affordability of these technologies and mention a specific region or country that has made significant strides in renewable energy adoption. The sonnet must adhere to the traditional rhyme scheme and contain no more than 150 words. (Requirement 1, 2, 3, 4)',
  'Write a 500-word essay discussing the impact of climate change on global food security, focusing on how rising temperatures and shifting precipitation patterns affect crop yields and agricultural productivity. Include data from recent studies and provide examples of regions most vulnerable to these changes. Additionally, propose three sustainable agricultural practices that can help mitigate this issue, such as agroforestry, conservation agriculture, and precision farming, and explain their benefits.',
  'In a 3-page, APA-style essay, analyze the relationship between climate change and food insecurity, emphasizing the challenges faced by smallholder farmers in developing countries. Use at least three scholarly sources to support your discussion. Furthermore, recommend two policy measures and one technological solution to promote climate-resilient agriculture, ensuring to address potential barriers to implementation and adoption in your analysis.',
  'Write a 300-word essay comparing the efficiency and environmental impact of solar panels and wind turbines in reducing carbon emissions, including real-world examples of their implementation and data on their energy production capabilities. Additionally, discuss the challenges and benefits of integrating these renewable energy sources into existing power grids.']]

Open Question Pipeline #

The run_open_qa_pipeline can be used to create open questions about desired topics and subtopics.

Prompt Modification at Pipeline Level #

You can freely adjust the prompts, even at the pipeline level!

In [108]:
# define new open QA prompt
NEW_OPEN_QA_PROMPT = """\
Can you generate {n_openlines} questions or requests related to {topic}? The questions should build off eachother. Your answer should be a list.
"""

# run open QA pipeline
open_qa_questions = await generator.run_open_qa_pipeline(
    n_macro_topics=1,
    n_subtopics=2,
    n_openlines=5,
    n_revisions=2,
    model=model,
    open_qa_from_topics_prompt_template=NEW_OPEN_QA_PROMPT, # substitute the default prompt with the new prompt
    ignore_conversion_failure=True
)

[100%|██████████| 10/10 [00:52<00:00,  5.27s/it]
 100%|██████████| 1/1 [00:52<00:00, 52.70s/it]
 100%|██████████| 10/10 [01:25<00:00,  8.56s/it]
 100%|██████████| 10/10 [01:33<00:00,  9.39s/it]
 100%|██████████| 4/4 [00:58<00:00, 14.55s/it]
 100%|██████████| 3/3 [03:57<00:00, 79.24s/it]
 100%|██████████| 10/10 [01:24<00:00,  8.42s/it]
 100%|██████████| 10/10 [01:13<00:00,  7.40s/it]
 100%|██████████| 10/10 [00:56<00:00,  5.68s/it]
 100%|██████████| 10/10 [01:01<00:00,  6.11s/it]
 100%|██████████| 10/10 [00:48<00:00,  4.89s/it]
 100%|██████████| 10/10 [01:12<00:00,  7.22s/it]
 100%|██████████| 10/10 [00:51<00:00,  5.19s/it]
 100%|██████████| 10/10 [00:48<00:00,  4.83s/it]
 100%|██████████| 10/10 [01:04<00:00,  6.50s/it]
 100%|██████████| 10/10 [00:53<00:00,  5.35s/it]
 100%|██████████| 10/10 [00:53<00:00,  5.35s/it]
 100%|██████████| 10/10 [00:49<00:00,  4.94s/it]
 100%|██████████| 12/12 [11:58<00:00, 59.89s/it]
]

In [111]:
open_qa_questions[0]

['Given the urgent need to reduce carbon emissions and promote sustainable urban development, could you provide examples of cities or regions that have successfully implemented green transportation technologies, such as electric buses or bike-sharing systems? In your response, please detail the specific infrastructure changes these cities made to support these innovative solutions, like installing charging stations or creating dedicated bike lanes.']

Closed Question Pipeline #

You can use the run_closed_qa_pipeline to generate questions specific to a provided context.

We’ll be using a snippet of an NVIDIA Blog as our context for this example.

In [133]:
blog_text = """\
NVIDIA today announced Nemotron-4 340B, a family of open models that developers can use to generate synthetic data for training large language models (LLMs) for commercial applications across healthcare, finance, manufacturing, retail and every other industry.

High-quality training data plays a critical role in the performance, accuracy and quality of responses from a custom LLM — but robust datasets can be prohibitively expensive and difficult to access.

Through a uniquely permissive open model license, Nemotron-4 340B gives developers a free, scalable way to generate synthetic data that can help build powerful LLMs.

The Nemotron-4 340B family includes base, instruct and reward models that form a pipeline to generate synthetic data used for training and refining LLMs. The models are optimized to work with NVIDIA NeMo, an open-source framework for end-to-end model training, including data curation, customization and evaluation. They’re also optimized for inference with the open-source NVIDIA TensorRT-LLM library.

Nemotron-4 340B can be downloaded now from the NVIDIA NGC catalog and from Hugging Face, where developers can also use the Train on DGX Cloud service to easily fine-tune open AI models. Developers will soon be able to access the models at ai.nvidia.com, where they’ll be packaged as an NVIDIA NIM microservice with a standard application programming interface that can be deployed anywhere.
"""

The output of the pipeline is in tuple format:

[
    (0, "Sample Question About Document at Index 0"),
    ...,
    (1, "Sample Question ABout Document at Index 1"),
    ...,
    (2, "Sample Question ABout Document at Index 2")
]

Where the 1st element of the tuple refers to the index of the document the question (in the 2nd element of the tuple) pertains to.

In [140]:
closed_qa_questions = await generator.run_closed_qa_pipeline(
    documents=[blog_text], # pass the blog text as a list
    n_openlines=10,
    model=model,
    ignore_conversion_failure=True
)
closed_qa_questions

[100%|██████████| 1/1 [01:14<00:00, 74.29s/it]
 100%|██████████| 1/1 [01:14<00:00, 74.29s/it]
]

[[(0,
   "Can you summarize the main purpose of NVIDIA's newly announced Nemotron-4 340B in one sentence?"),
  (0,
   'Explain how the Nemotron-4 340B family of models can help developers create custom LLMs for various industries.'),
  (0,
   'Write a short paragraph about the significance of high-quality training data in the development of LLMs and how Nemotron-4 340B addresses this challenge.'),
  (0,
   'How does NVIDIA NeMo and TensorRT-LLM library relate to Nemotron-4 340B, and what benefits do they provide to developers?'),
  (0,
   'Create a tweet announcing the release of Nemotron-4 340B, highlighting its key features and benefits for developers.'),
  (0,
   'Identify the three types of models included in the Nemotron-4 340B family and explain their roles in the synthetic data generation pipeline.'),
  (0,
   'Rephrase the section about the availability of Nemotron-4 340B, focusing on the various platforms where developers can access and utilize the models.'),
  (0,
   'Compare and contrast the process of training LLMs with and without the use of synthetic data generated by Nemotron-4 340B.'),
  (0,
   'Write a brief blog post introduction discussing the importance of open models like Nemotron-4 340B in democratizing AI and fostering innovation.'),
  (0,
   'Design a simple infographic that illustrates the workflow of using Nemotron-4 340B to generate synthetic data and train custom LLMs.')]]

Python Question Generation Pipeline #

The run_python_pipeline can be used to generate questions pertaining to Python tasks.

In [141]:
python_questions = await generator.run_python_pipeline(
    n_macro_topics=3,
    n_subtopics=2,
    n_openlines=10,
    model=model,
    ignore_conversion_failure=True
)
python_questions[:5]

[100%|██████████| 3/3 [01:26<00:00, 28.78s/it]
 100%|██████████| 1/1 [01:26<00:00, 86.35s/it]
 100%|██████████| 3/3 [02:02<00:00, 40.88s/it]
 100%|██████████| 1/1 [02:02<00:00, 122.65s/it]
]

[['Write a Python program to create a stack data structure using a list and implement the push, pop, and is_empty methods.',
  'Create a program that uses a queue data structure (implemented with a list) to manage a simple printing queue. The program should allow users to add print tasks to the queue, process the tasks one by one, and display the current queue status.',
  'Implement a Python program that uses a dictionary to count the frequency of each word in a given string. The program should also print the words in descending order of their frequency.',
  'Write a Python program to create a linked list data structure and implement the append, insert, and delete methods. Additionally, create a method to display the linked list elements.',
  'Develop a Python program that uses a set data structure to find the union, intersection, and difference between two given lists of numbers.']]

Dialogue Generation Pipeline #

This pipeline will take existing openlines (generated by any of the above pipelines, as an example) and convert them into dialogues or multiple turn conversations:

generate_dialogue: This will generate n number of turns between a User and an Assistant.

In [94]:
dialogue = await generator.generate_dialogue(
    openline="Write a poem about the moon.",
    user_model=model,
    assistant_model=model,
    n_user_turns=3,
)

In [95]:

dialogue

[[{'role': 'user', 'content': 'Write a poem about the moon.'},
  {'role': 'assistant',
   'content': "In the velvet night, a silver orb ascends,\nA quiet guardian, on whom the stars depend.\nThe moon, with her gentle, radiant grace,\nGuides us through the darkness, with her soft, lunar face.\n\nShe's a cosmic poet, whispering tales of old,\nOf ancient seas and craters, stories to be told.\nHer phases mark the passage of time, a celestial clock,\nFrom crescent to full, in her grand, cyclical walk.\n\nShe inspires the tides, with her gravitational pull,\nA dance between the earth and moon, beautiful and full.\nHer light, though borrowed, is never less,\nA beacon of hope, in the abyss of endless space.\n\nTo lovers, she's a confidant, bearing silent witness,\nTo promises made under her soft, ethereal kiss.\nTo dreamers, she's a muse, igniting sparks of creativity,\nA canvas for imagination, boundless and free.\n\nSo here's to the moon, our companion in the night,\nA symbol of mystery, bathed in soft, silver light.\nMay she continue to inspire, to guide, and to soothe,\nA timeless friend, in the grand theater of the universe's roof.\n"},
  {'role': 'user',
   'content': 'Could you now compose a poem about the sun, with a similar tone and length, showcasing its power, importance, and the life it brings?\n'},
  {'role': 'assistant',
   'content': "In the grand theater of the sky, a golden orb ascends,\nThe sun, with its fiery, life-giving power, on which all life depends.\nA blazing sovereign, ruling the azure dome,\nBringing warmth and light, to every earthly home.\n\nIt's a cosmic artist, painting sunsets and dawns,\nWith hues of crimson, orange, and gold, on horizons it adorns.\nIts rays, like fingers, reach through the clouds,\nAwakening the world, dispelling night's shrouds.\n\nThe sun is a giver, of energy and life,\nTo plants and creatures, in their daily strife.\nIts light, a catalyst, for growth and photosynthesis,\nA symphony of life, in every verdant abyss.\n\nTo farmers, it's a blessing, ripening fields of grain,\nTo sailors, a compass, guiding them through the main.\nTo children, it's a playmate, in games of shadow and light,\nA source of joy, in the warmth of its inviting sight.\n\nSo here's to the sun, our radiant, celestial king,\nA symbol of power, life, and the joy that it brings.\nMay it continue to radiate, to nourish, and to inspire,\nA timeless beacon, in the grand cosmic choir.\n\nYet, let us not forget, to respect its might,\nFor its fiery embrace, can be as harsh as it is bright.\nA delicate balance, between life and death,\nThe sun, a reminder, of our shared, planetary breath.\n"},
  {'role': 'user',
   'content': 'Could you now create a poem about the earth, connecting it with the moon and sun, emphasizing their interdependence and the delicate balance they maintain for life to flourish?\n'},
  {'role': 'assistant',
   'content': "In the cosmic dance, three celestial bodies entwine,\nThe Earth, the Moon, and Sun, in a rhythmic, harmonious line.\nA trinity of existence, each playing a vital role,\nIn the grand, intricate symphony, of life's eternal goal.\n\nThe Earth, a nurturing mother, teeming with life,\nA vibrant, blue oasis, amidst the cosmic strife.\nShe's a canvas for the sun's warm, golden touch,\nAnd the moon's soft, silver kiss, in the hush of night's clutch.\n\nThe sun, a fiery father, bestowing energy and light,\nGuiding the seasons, in their cyclical flight.\nHis rays, a lifeline, for the Earth's myriad forms,\nA dance of photosynthesis, in forests and farms.\n\nThe moon, a gentle sister, regulating tides and time,\nA silent guardian, in the grand, cosmic rhyme.\nHer phases, a metronome, for life's rhythmic beat,\nA soothing balm, in the night's serene retreat.\n\nTogether, they maintain a delicate balance, a celestial pact,\nEach giving and taking, in a mutual, harmonious act.\nThe Earth, the Sun, and Moon, in a waltz of interdependence,\nA testament to the beauty, of nature's grand alliance.\n\nYet, this balance is fragile, requiring our utmost care,\nFor the health of our Earth, is beyond compare.\nLet us honor this trinity, with respect and love,\nA sacred trust, in the vastness above.\n\nSo here's to the Earth, the Sun, and the Moon,\nA celestial family, in the grand, cosmic swoon.\nMay we cherish their bond, and the life it sustains,\nIn the grand, interconnected dance, of the universe's veins.\n"}]]

Two-Turn Prompt Generation Pipeline #

This pipeline will take, as input, a single openline - and it will return an Assistant response, as well as a synthetically generated User response. This can be used as a prompt for a two-turn Assistant response.

generate_two_turn_prompt: This will generate: “User (Input) -> Assistant (Output) -> User (Output)”.

In [96]:
dialogue = await generator.generate_two_turn_prompt(
    openline="Write a poem about the moon.",
    user_model=model,
    assistant_model=model,
)

In [97]:

dialogue

[[{'role': 'user', 'content': 'Write a poem about the moon.'},
  {'role': 'assistant',
   'content': "In the velvet night, a silver orb ascends,\nA quiet guardian, on whom the darkness depends.\nThe moon, with her gentle, ethereal glow,\nGuides us through shadows, in her soothing flow.\n\nCraters and valleys, a face etched in time,\nA silent witness to the cosmos' rhyme.\nHer phases a dance, from crescent to full,\nIn her rhythmic cycle, she pulls and she lull.\n\nShe whispers to lovers, under her soft light,\nInspiring poetry in the stillness of night.\nTo the tides she sings, a celestial song,\nIn her gravitational embrace, they belong.\n\nA beacon for sailors, lost in the sea,\nA companion to dreamers, in their reverie.\nHer light, though borrowed, is never less,\nA reflection of the sun, in her elegant dress.\n\nSo here's to the moon, our celestial guide,\nIn her we find peace, in her we confide.\nThrough the ebb and flow of life's vast sea,\nShe's a constant reminder, of the beauty that's free.\n"},
  {'role': 'user',
   'content': 'Could you now compose a poem about the sun, as a companion piece to the moon poem?\n'}]]

Entity Classification #

Aside from generating synthetic responses - the NeMo Curator SDG toolset can also be used to verify or identify if particular entities are related to Math, or Python out of the box.

NOTE: This process can be extended to a more diverse set of topics by modifying the prompt templates.

Classify Math Entity #

The prompt used for this task is as follows:

'Does the concept "{entity}" belong to one of the following categories?\n- Math concepts taught at elementary school, middle school, high school, and univiersity.\n- Important mathematics axioms, theorems, algorithms, equations, or inequalities.\n- Representative math problems, functions, and applications.\n\nYour answer should start with "Yes" or "No".'

classify_math_entity: This will classify if an entity is related to math or not.

In [16]:
response = await generator.classify_math_entity(
    entity="What is the formula for the area of a circle?",
    model=model
)
response

[['Yes, the concept "What is the formula for the area of a circle?" belongs to the first category: Math concepts taught at elementary school, middle school, high school, and university. Specifically, the formula for the area of a circle, which is A = πr², is typically taught in middle school or early high school.\n']]

In [19]:
response = await generator.classify_math_entity(
    entity="Pizza Pie is so delicious.",
    model=model
)
response

[['No, the concept "Pizza Pi is so delicious" does not belong to any of the listed categories, as it is not a mathematical concept, axiom, theorem, algorithm, equation, inequality, problem, function, or application. It appears to be a subjective statement about the taste of a food item named "Pizza Pi."\n']]

Classify Python Entity #