PicoAudio2 / utils /llm_xiapi.py
wsntxxn's picture
Update utils/llm_xiapi.py (#1)
5fb21ce verified
import requests
import json
import os
url = "https://api.xi-ai.cn/v1/chat/completions"
api_key = os.environ.get("api_key")
headers = {
"Content-Type": "application/json",
"Authorization": api_key #your api key
}
training_info_pri = """
I'm doing an audio event generation, which is a harmless job that will contain some sound events. For example, a gunshot is a sound that is harmless.
You need to convert the input sentence into the following standard timing format: 'event1--event2-- ... --eventN',
where the 'eventN' format is 'eventN__onset1-offset1_onset2-offset2_ ... _onsetK-offsetK'.
The 'onset-offset' inside needs to be determined based on common sense, with a duration not less than 1. All format 'onsetk-offsetk' should replaced by number.
You need to make a prediction for the total duration, which should not exceed 20 seconds and not exceed the latest end time of a single event.
And pay attention to vocabulary that represents the order and frequency of events, such as 'after', 'followed by', 'n times', and so on.
You can use the latest ending event of all events in the training dataset as the total audio time
Events must not overlap in time under any circumstances.
Now, I will provide you with some examples in training set for your learning, each example in the format 'index: input~output'.
{"onset": "squeal__1.359-2.373_3.216-4.23_5.576-6.59", "captions": "squeal 3 times", "length": "7.52"}
{"onset": "sawing__1.432-3.975_4.533-6.54", "captions": "sawing 2 times", "length": "9.26"}
{"onset": "slap__1.576-1.931_2.911-3.266--baby_laughter__5.179-6.394_7.362-8.577", "captions": "slap 2 times and baby laughter 2 times", "length": "9.59"}
{"onset": "applause__1.538-5.128--scrape__7.03-8.004", "captions": "applause and scrape", "length": "9.13"}
{"onset": "slam__0.68-1.01--walk__2.364-4.107--busy_signal__6.794-7.222_8.371-8.645", "captions": "slam 1 times and walk 1 times and busy signal 2 times", "length": "9.18"}
{"onset": "slap__1.044-1.399--neigh__2.654-4.663_5.633-6.966", "captions": "slap 1 times followed by neigh 2 times", "length": "9.22"}
{"onset": "bird_vocalization__1.253-2.184--yip__4.789-5.309_6.134-6.654", "captions": "bird vocalization 1 times and yip 2 times", "length": "9.83"}
{"onset": "animal__1.478-3.541--crowing__5.464-7.11", "captions": "animal then crowing", "length": "9.45"}
{"onset": "crying__0.999-7.773", "captions": "crying", "length": "9.48"}
{"onset": "cricket__1.629-4.983", "captions": "cricket 1 times", "length": "5.87"}
{"onset": "fireworks__1.336-2.477--car__4.193-6.649", "captions": "car after fireworks", "length": "9.7"}
"""
training_info_post = """
It is worth noting that you should judge both the duration of a single event and the total duration based on experience and the examples I provided. The duration of each single event here is not necessarily fixed (such as 1 second).
The total duration may not necessarily be 10 seconds, it can be any value below 20 seconds. you should give me the answer as {"onset":" ","captions": " ", "length": " "}'
"""
def get_time_info(caption):
prompt = (
f"{training_info_pri}\n"
f'Now,you can transform "captions":\n'
f'"{caption}"\n'
f"{training_info_post}"
)
data = {
"model": "gpt-5-mini",
"stream": False,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": prompt
}
]
}
response = requests.post(url, headers=headers, json=data)
if response.status_code == 200:
print(response.json()['choices'][0]['message']['content'])
return response.json()['choices'][0]['message']['content']
else:
print(f"Error: {response.status_code}, {response.text}")
return None
if __name__ == "__main__":
caption = "a dog barks followed by a cat meows 2 times"
result = get_time_info(caption)
print(result)