Здесь у меня есть словари:
dict_assembly = {'ind1gene1':'individual1', 'ind1gene2':'individual1','ind1gene3':'individual1', 'ind2gene1':'individual2', 'ind2gene2':'individual2','ind2gene3':'individual2', 'ind3gene1':'individual3', 'ind3gene2':'individual3','ind3gene3':'individual3','ind4gene1':'individual4','ind4gene2':'individual4','ind4gene3':'individual4','ind4gene4':'individual4'}
dict_bhit = {'ind1gene1':'AAAAA', 'ind1gene2':'BBBBB','ind1gene3':'CCCCC', 'ind2gene1':'AAAAA', 'ind2gene2':'BBBBB','ind2gene3':'BBBBB', 'ind3gene1':'AAAAA', 'ind3gene2':'BBBBB','ind3gene3':'CCCCC','ind4gene1':'AAAAA','ind4gene2':'BBBBB','ind4gene3':'CCCCC','ind4gene4':'DDDDD'}
dict_identity = {'ind1gene1':'98','ind2gene1':'96','ind3gene1':'95','ind4gene1':'96','indi5gene1':'94','ind1gene2':'67','ind2gene2':'76','ind3gene2':'80','ind4gene2':'77','ind5gene2':'76','ind1gene3':'98','ind2gene3':'97','ind3gene3':'96','ind4gene3':'96','ind4gene4':'40'}
data = {} # temporary dictionary
Коды, используемые для этого примера, разбиты на два блока.
Первая часть:
import pandas as pd
import time
start = time.time()
matrix_file = open("concatenated.matrix", "w" )
col_subject = ['query', 'subject']
df_accession = pd.DataFrame(dict_bhit.items(), columns=col_subject)
col_genome = ['query', 'genome']
df_assembly = pd.DataFrame(dict_assembly.items(), columns=col_genome)
df_assembly['subject'] = df_assembly['query'].map(df_accession.set_index('query')['subject'])
matrix = pd.get_dummies(df_assembly.set_index('genome')['subject']).max(level=0).max(level=0, axis=1)
matrix.to_csv(matrix_file, sep='\t', header=True, index=True)
print matrix
end = time.time()
print 'This step spent',round(end - start, 4), 'seconds\n'
Вторая часть:
start = time.time()
matrix_file = open("identity.matrix", "w" )
col_bhit = ['gene', 'subject']
df_bmatch = pd.DataFrame(dict_bhit.items(), columns=col_bhit) # convert "dict_bhit" into a dataframe
col_file = ['gene', 'assembly']
df_origin = pd.DataFrame(dict_assembly.items(), columns=col_file) # convert "dict_assembly" into a dataframe
col_percent = ['gene', 'percent']
df_percent = pd.DataFrame(dict_identity.items(), columns=col_percent) # convert "dict_bhit" into a dataframe
for k, col in dict_assembly.items():
if k in dict_bhit and k in dict_identity:
data.setdefault(dict_bhit[k], {})[col] = dict_identity[k]
elif k in dict_bhit and k not in dict_identity:
data.setdefault(dict_bhit[k], {})[col] = "NA"
df = pd.DataFrame(data)
df.to_csv(matrix_file, sep='\t', header=True, index=True)
print df
end = time.time()
print 'This step spent',round(end - start, 4), 'seconds\n'
Любое предложение о том, как сократить время обработки для создания второй таблицы? Как вы можете видеть, значения времени различаются в 2 раза.
Saving presence/absence table ...
AAAAA BBBBB CCCCC DDDDD
genome
individual1 1 1 1 0
individual2 1 1 0 0
individual3 1 1 1 0
individual4 1 1 1 1
This step spents 0.0084 seconds
Saving identity table...
AAAAA BBBBB CCCCC DDDDD
individual1 98 67 98 NaN
individual2 96 76 NaN NaN
individual3 95 80 96 NaN
individual4 96 77 96 40
This step spents 0.0106 seconds
Чтобы решить эту проблему и отдохнуть несколько секунд в большом наборе данных, я прокомментировал две строки в "elif" (вариант 1).
Опция 1:
for k, col in dict_assembly.items():
if k in dict_bhit and k in dict_identity:
data.setdefault(dict_bhit[k], {})[col] = dict_identity[k]
#elif k in dict_bhit and k not in dict_identity:
#data.setdefault(dict_bhit[k], {})[col] = "NA"
df = pd.DataFrame(data)
df.to_csv(matrix_file, sep='\t', header=True, index=True)
print df
Для небольшого набора данных вы можете использовать Option2, непосредственно удаляя условие "if".
Option 2:
for k, col in dict_assembly.items():
data.setdefault(dict_bhit[k], {})[col] = dict_identity[k]
df = pd.DataFrame(data)
df.to_csv(matrix_file, sep='\t', header=True, index=True)
print df