I have a task that pushes me to my limit.
I hope that you can help me a little further :-)
Any help and feedback to the path is appreciated
Situation:
I get answers to questions that are half single choice questions. The other half are numerical questions.
In this case:
Questions 1, 3 and 5: single choice question with
4 possible answer options: e.g. 1a / 1b / 1c / 1d
Question 2 and 4:
numerical questions.
Answer options: e.g. (any value)
I have very limited influence on the form/strucure of the answer. As a result, there are different variants of the answer.
Here is an example:
I get the following data as a CSV
Code: Select all
<bound method NDFrame.head of Name Value Date_Time
0 User 1 5c 2020-11-08T23:38:23Z
1 User 1 1000 2020-11-08T23:34:23Z
2 User 1 3B 2020-11-08T23:32:23Z
3 User 1 90 2020-11-08T23:30:23Z
4 User 1 1B 2020-11-08T23:28:23Z
5 User 2 150kg 2020-11-08T22:34:23Z
6 User 2 3B 2020-11-08T22:32:23Z
7 User 2 $80 USD 2020-11-08T22:30:23Z
8 User 3 5D 2020-11-08T21:34:23Z
9 User 3 300 2020-11-08T21:32:23Z
10 User 3 80 $ 2020-11-08T21:30:23Z
11 User 3 1B 2020-11-08T21:28:23Z
12 User 4 1: A 2020-11-08T20:36:23Z
13 User 4 2 - 17 2020-11-08T20:34:23Z
14 User 4 3:A 2020-11-08T20:32:23Z
15 User 4 4 - 326 2020-11-08T20:30:23Z
16 User 4 5 - 25000 2020-11-08T20:28:23Z>
With pivot I can group the data to one string for each user.
Code: Select all
df=comments[['Name', 'Value']]
table = pd.pivot_table(df, index='Name', values='Value', aggfunc=lambda x: ' '.join(x))
print(table)
Code: Select all
<bound method NDFrame.head of Name,Value
0 User 1,5c 1000 3B 90 1B
1 User 2,150kg 3B $80 USD
2 User 3,5D 300 80 $ 1B
3 User 4,1: A 2 - 17 3:A 4 - 326 5 - 25000>
The first goal is to create a target table in which each substring is correctly assigned to the name and the answer no. / (each column)
I can't get any further at this point. Considerations with dictionary or stringio did not lead to a solution.
I hope you have some ideas for me :-)
This is how the table should look like
Target Table 1 (Substring to answer no. in column)
Code: Select all
Name Answer 1 Answer 2 Answer 3 Answer 4 Answer 5
User 1 1B 90 3B 1000 5c
User 2 NaN $80 USD 3A 150kg NaN
User 3 1B 80 $ NaN 300 5D
User 4 1: A 2 - 17 3:A 4 - 326 5:B
I think the following criteria can be helpful:
- - the number in front of a letter indicates the answer no (answer 1 3 5)
- the letters are always a/b/c/d
- the numerical answers are between answer 1 & 3 / 3 & 5 (and its criteria)
- attention: no value can also be in the table (User 2 - answer 1 & 5 / User 3 - answer 3)
I guess it's difficult to meet all requirements.
But If we find a 80% solution for this problem I will be very happy. :-)
That would help me to get to the next step:
The second goal is to modify the substrings in the target table in such a way that you only have unique standard values without special characters or anything else.
- Letters only for single choice questions.
Numeric values (without $ or kg etc.) for numeric questions
Target Table 2 (modified to unique standard values)
Code: Select all
Name Answer 1 Answer 2 Answer 3 Answer 4 Answer 5
User 1 B 90 B 1000 C
User 2 0 80 A 150 0
User 3 B 80 0 300 D
User 4 A 17 A 326 B