Usage¶
Installation¶
To use Besimilarity, first install it using pip:
(.venv) $ pip install besimilarity
Quickstart¶
- besimilarity.AgglomerativeBestSimilarity
You can use the
besimilarity.AgglomerativeBestSimilarity()
class to fit your data with best similarity/distance equation based on Agglomerative Clustering. This class also can find best linkage method based on the used similarity/distance.Make sure your target column is in the last column.:
>> from besimilarity import AgglomerativeBestSimilarity >> aggbs = AgglomerativeBestSimilarity() >> df = pd.DataFrame(data={ 'A': [1, 1, 0, 0, 0, 0, 0, 0, 0, 0], 'B': [0, 0, 1, 1, 0, 0, 0, 0, 0, 0], 'C': [0, 0, 0, 0, 1, 1, 0, 0, 0, 0], 'D': [0, 0, 0, 0, 0, 0, 1, 1, 0, 0], 'E': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1], 'F': [0, 0, 0, 0, 1, 1, 1, 1, 1, 0], 'G': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'H': [0, 1, 1, 1, 1, 1, 0, 0, 0, 0], 'I': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Target': [0, 0, 0, 1, 1, 1, 1, 1, 1, 0]})`` >> aggbs.fit(df, n_clusters=2) df shape: (10, 10) n_clusters: 2 affinity: all linkage: all 100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.06s/it] >> result = aggbs.get_result() >> print(result.head()) linkage equation homogeneity completeness v_measure 0 complete sim gower 0.628236 0.609987 0.618977 1 complete sim stiles 0.573438 0.631777 0.601196 2 average sim peirce 0.432538 0.432538 0.432538 3 complete sim fager_mcgowan 0.331560 0.445928 0.380332 4 average sim fager_mcgowan 0.331560 0.445928 0.380332 Or, if you want to calculate based on pair of your target, you can use ``besimilarity.PairBestSimilarity()`` class instead.
- besimilarity.PairBestSimilarity
Class ini menggunakan pasangan dari kombinasi setiap baris untuk mengetahui apakah kombinasi memiliki nilai kesamaan yang besar atau tidak dan dengan target yang sesuai atau tidak. Jika target sama, maka sudah sepatutnya kedua baris tersebut memiliki nilai kesamaan yang besar.:
>> from besimilarity import PairBestSimilarity >> pbs = PairBestSimilarity() >> df = pd.DataFrame(data={ 'A': [1, 1, 0, 0, 0, 0, 0, 0, 0, 0], 'B': [0, 0, 1, 1, 0, 0, 0, 0, 0, 0], 'C': [0, 0, 0, 0, 1, 1, 0, 0, 0, 0], 'D': [0, 0, 0, 0, 0, 0, 1, 1, 0, 0], 'E': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1], 'F': [0, 0, 0, 0, 1, 1, 1, 1, 1, 0], 'G': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'H': [0, 1, 1, 1, 1, 1, 0, 0, 0, 0], 'I': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Target': [0, 0, 0, 1, 1, 1, 1, 1, 1, 0]})`` >> bs.fit(df, use_seed=True, num_sample=2) use_seed: True num_sample: 2 1it [00:00, 96.36it/s] 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 21.34it/s] final 5 best similarity: sim/dis name mean_auc 0 jaccard similarity 1.0 1 gower similarity 1.0 2 hellinger dissimilarity 1.0 3 pearson_heron_1 similarity 1.0 4 dice_2 similarity 1.0