파이썬 실전 데이터 분석

Updated: September 13, 2020

트럼프 대통령 트윗 분석하기

실습자료

첫번째 프로젝트에서는 트럼프 대통령이 2017년 1월 20일 취임 이후 1년 동안 게시한 2,500여 개의 트윗을 분석해봅니다.

가장 많이 사용한 #해시태그
가장 많이 사용한 키워드
가장 많이 사용한 @멘션
월별 트윗 통계

분석 후, 데이터의 유형에 알맞은 시각화 코드를 살펴봅니다.

막대 그래프
단어 구름

코드를 작성하기 전에 tweets.py 파일과 main.py의 스켈레톤 코드를 살펴보세요.

성해야 하는 함수
- preprocess_text(text)
- analyze_text(words)
- filter_by_month(tweet_data, month)

세부 구현 사항

preprocess_text(text)

문자열 text를 가공하여 반환합니다.

모든 알파벳 대문자를 알파벳 소문자로 변경합니다.
특수문자를 삭제합니다.
가공 된 텍스트를 공백을 기준으로 나누어 리스트 형태로 반환합니다.
호출 예시

preprocess_text("On my way! #Inauguration2017 https://t.co/hOuMbxGnpe")

반환 예시

['on', 'my', 'way', '#inauguration2017', 'httpstcohoumbxgnpe']

analyze_text(words) 문자열을 담고 있는 words 리스트가 주어집니다.

각각의 원소는 모두 keywords리스트에 저장하되, @, # 문자는 제외하고 저장해야 합니다. (예 : #tweet는 tweet의 값으로 저장한다.)
’#’ 문자로 시작하는 원소는 hashtags 리스트에, @문자로 시작하는 원소는 mentions 리스트에 각각 #와 @을 제거하고 저장합니다.
함수는 keywords, hashtags, mentions 를 반환해야 합니다.
반환 결과에서 첫 번째 리스트는 모든 키워드, 두 번째 리스트는 해쉬태그 키워드, 세 번째 리스트는 멘션 키워드를 갖고 있습니다.
호출 예시

analyze_text(['on', 'my', 'way', '#inauguration2017', 'httpstcohoumbxgnpe'])

반환 예시

(['on', 'my', 'way', 'inauguration2017', 'httpstcohoumbxgnpe'], ['inauguration2017'], [])

filter_by_month(tweet_data, month)

트윗 데이터와 트윗이 작성된 월(정수)을 입력 받아 해당 월에 게시된 트윗을 리스트에 저장한 후, 반환합니다.

호출 예시

filter_by_month(tweet_data, month)에서

tweet_data = [('01-19-2017 20:13:57', 'On my way! #Inauguration2017 https://t.co/hOuMbxGnpe'), 
('02-01-2017 00:31:08', 'Getting ready to deliver a VERY IMPORTANT DECISION!  8:00 P.M.'), 
('03-03-2017 02:27:29', '...intentional. This whole narrative is a way of saving face for 
Democrats losing an election that everyone thought they were supposed.....'), 
('03-03-2017 02:35:33', '...to win. The Democrats are overplaying their hand. They lost the election and now 
they have lost their grip on reality. The real story...')]
month = 3

인 상태로 호출한 결과는 다음과 같습니다.

반환 예시

['...intentional. This whole narrative is a way of saving face for Democrats losing an 
election that everyone thought they were supposed.....', '...to win. The Democrats 
are overplaying their hand. They lost the election and now they have lost their grip on reality. The real story...']

전체 트윗 데이터 중 3월에 작성한 트윗의 내용을 리스트에 담아 반환했습니다.

# 트럼프 대통령의 트윗 모음을 불러옵니다.
from tweets import trump_tweets

# 그래프에 필요한 라이브러리를 불러옵니다. 
import matplotlib.pyplot as plt

# 단어구름에 필요한 라이브러리를 불러옵니다. 
import numpy as np
from PIL import Image
from wordcloud import WordCloud

# 특화된 컨테이너 모듈에서 수 세기를 돕는 메소드를 불러옵니다.
from collections import Counter

# 문자열 모듈에서 특수문자를 처리를 돕는 메소드를 불러옵니다. 
from string import punctuation

# 엘리스에서 파일 송출에 필요한 패키지를 불러옵니다. 
from elice_utils import EliceUtils
elice_utils = EliceUtils()

from stopwords import stopwords


# 데이터 전처리를 실행합니다. 
def preprocess_text(text):
    # 분석을 위해 text를 모두 소문자로 변환합니다.
    text = text.lower()
    # @와 #을 제외한 특수문자로 이루어진 문자열 symbols를 만듭니다.
    symbols = punctuation.replace('@', '').replace('#', '')
    
    for i in symbols:
        text = text.replace(i,'')
    
    arr = text.split()
    
    return arr
    

# 해시태그와 키워드를 추출합니다. 
def analyze_text(words):
    # 키워드, 해시태그, 멘션을 저장할 리스트를 각각 생성합니다.
    keywords, hashtags, mentions = [], [], []
    
    for i in words:
        if i.startswith('#'):
            hashtags.append(i.replace('#',''))
            keywords.append(i.replace('#',''))
        elif i.startswith('@'):
            mentions.append(i.replace('@',''))
            keywords.append(i.replace('@',''))
        else:
            keywords.append(i)

    print(keywords, hashtags, mentions)
    
    return keywords, hashtags, mentions


def filter_by_month(tweet_data, month):
        
    month_string = '0' + str(month) if month < 10 else str(month)
    
    # 선택한 달의 트윗을 filtered_tweets에 저장합니다.
    filtered_tweets = []
    
    # 트윗의 날짜가 선택한 달에 속해 있으면 트윗의 내용을 filtered_tweets에 추가합니다.
    #print(tweet_data[0])
    for tweet in tweet_data:
        if tweet[0].startswith(month_string):
            filtered_tweets.append(tweet[1])
    return filtered_tweets

# 트윗 통계를 출력합니다.
def show_stats():
    keyword_counter = Counter()
    hashtag_counter = Counter()
    mention_counter = Counter()
    
    for _, tweet in trump_tweets:
        keyward, hashtag, mention = analyze_text(preprocess_text(tweet))
        keyword_counter += Counter(keyward)
        hashtag_counter += Counter(hashtag)
        mention_counter += Counter(mention)
    
    # 가장 많이 등장한 키워드, 해시태그, 멘션을 출력합니다.
    top_ten = hashtag_counter.most_common(10)
    for hashtag, freq in top_ten:
        print('{}: {}회'.format(hashtag, freq))


# 월 별 트윗 개수를 보여주는 그래프를 출력합니다. 
def show_tweets_by_month():
    months = range(1, 13)
    num_tweets = [len(filter_by_month(trump_tweets, month)) for month in months]
    
    plt.bar(months, num_tweets, align='center')
    plt.xticks(months, months)
    
    plt.savefig('graph.png')
    elice_utils = EliceUtils()
    elice_utils.send_image('graph.png')


# wordcloud 패키지를 이용해 트럼프 대통령 실루엣 모양의 단어구름을 생성합니다.
def create_word_cloud():
    
    counter = Counter()
    for _, tweet in trump_tweets:
        keywords, _, _ = analyze_text(preprocess_text(tweet))
        counter += Counter(keywords)
    
    trump_mask = np.array(Image.open('trump.png'))
    cloud = WordCloud(background_color='white', mask=trump_mask)
    cloud.fit_words(counter)
    cloud.to_file('cloud.png')
    elice_utils.send_image('cloud.png')


# 입력값에 따라 출력할 결과를 선택합니다. 
def main(code=1):
    # 가장 많이 등장한 키워드, 해시태그, 멘션을 출력합니다.
    if code == 1:
        show_stats()
    
    # 트럼프 대통령의 월별 트윗 개수 그래프를 출력합니다.
    if code == 2:
        show_tweets_by_month()
    
    # 트럼프 대통령의 트윗 키워드로 단어구름을 그립니다.
    if code == 3:
        create_word_cloud()


# main 함수를 실행합니다. 
if __name__ == '__main__':
    main(1)

영어 단어 모음 분석하기

실습자료

이 프로젝트에서는 영어 단어와 그 빈도수를 정리한 British National Corpus 단어 모음을 분석하고 시각화해봅니다.

corpus.txt를 이용해 가장 많이 사용된 영어 단어 분석
matplotlib을 이용해 단어 별 사용 빈도를 보여주는 막대 그래프 작성

분석 후《이상한 나라의 엘리스》동화책에 등장하는 단어 수와 BNC 데이터를 비교해보겠습니다.

가장 많이 등장하는 단어의 분포
불용어를 제외하고 가장 많이 사용된 단어

라이브 수업에서 함께 코드를 작성하기 전에 corpus.txt 파일과 main.py의 스켈레톤 코드를 살펴보세요.

작성해야 하는 함수
- import_corpus(filename)
- create_corpus(filenames)
- filter_by_prefix(corpus, prefix)
- most_frequent_words(corpus, number)
세부 구현 사항
1. import_corpus(filename)

단어와 빈도수 데이터가 담긴 파일 한 개를 불러온 후, (단어, 빈도수) 꼴의 튜플로 구성된 리스트를 반환합니다. 즉, 코퍼스 파일을 읽어 리스트로 변환하는 함수입니다.

반환 예시

[('zoo', 768), ('zones', 1168), ...

create_corpus(filenames)

텍스트 파일 여러 개를 한 번에 불러온 후, (단어, 빈도수) 꼴의 튜플로 구성된 리스트를 반환합니다. 즉, 텍스트 파일을 읽어들여 튜플꼴의 리스트 형태로 만드는 함수입니다.

반환 예시

[('Down', 3), ('the', 487), ('RabbitHole', 1), ...

filter_by_prefix(corpus, prefix)

(단어, 빈도수) 꼴의 튜플들을 담고 있는 리스트의 형태로 주어지는 corpus의 데이터 중 특정 문자열 prefix로 시작하는 단어 데이터만 추린 리스트를 반환합니다.

filter_by_prefix(corpus, "qu")

주어진 corpus 데이터 중에서 문자열 “qu”로 시작하는 데이터만 추려 반환합니다.

반환 예시

[('quotes', 700), ('quoted', 2663), ('quote', 1493),  ...

most_frequent_words(corpus, number)

corpus의 데이터 중 가장 빈도가 높은 number개의 데이터만 추립니다.

호출 예시

most_frequent_words(corpus, 3)

반환 예시

[('the', 6187927), ('of', 2941790), ('and', 2682878)]

# 프로젝트에 필요한 패키지를 import합니다.
from operator import itemgetter
from collections import Counter
from string import punctuation
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

from elice_utils import EliceUtils
elice_utils = EliceUtils()


def import_corpus(filename):
    # 튜플을 저장할 리스트를 생성합니다.
    corpus = []
    
    
    # 매개변수로 입력 받은 파일을 열고 읽습니다.
    with open(filename) as file:
        for line in file:
            word, num = line.split(',')
            num = int(num.replace('\n',''))
            corpus.append((word,num))
        # 텍스트 파일의 각 줄을 (단어, 빈도수) 꼴로 corpus에 저장합니다.
    
    return corpus


def create_corpus(filenames):
    # 단어를 저장할 리스트를 생성합니다.
    words = []
    
    # 여러 파일에 등장하는 모든 단어를 모두 words에 저장합니다.
            # 이 때 문장부호를 포함한 모든 특수기호를 제거합니다. 4번째 줄에서 임포트한 punctuation을  이용하세요.
    print(len(filenames))
    
    for txt in filenames:
        with open(txt) as file:
            content = file.read()
            for symbol in punctuation:
                content = content.replace(symbol,'')
            words = words + content.split()
    
    # words 리스트의 데이터를 corpus 형태로 변환합니다. Counter() 사용 방법을 검색해보세요.
    print(words)
    
    corpus = Counter(words)
    return list(corpus.items())


def filter_by_prefix(corpus, prefix):
    
    tmp = list(filter(lambda word : word[0].startswith(prefix), corpus))
    
    return tmp


def most_frequent_words(corpus, number):
    
    tmp = sorted(corpus, key = lambda corpus : corpus[1], reverse = True)[:number]
    
    return tmp
    

def draw_frequency_graph(corpus):
    # 막대 그래프의 막대 위치를 결정하는 pos를 선언합니다.
    pos = range(len(corpus))
    
    # 튜플의 리스트인 corpus를 단어의 리스트 words와 빈도의 리스트 freqs로 분리합니다.
    words = [tup[0] for tup in corpus]
    freqs = [tup[1] for tup in corpus]
    
    # 한국어를 보기 좋게 표시할 수 있도록 폰트를 설정합니다.
    font = fm.FontProperties(fname='./NanumBarunGothic.ttf')
    
    # 막대의 높이가 빈도의 값이 되도록 설정합니다.
    plt.bar(pos, freqs, align='center')
    
    # 각 막대에 해당되는 단어를 입력합니다.
    plt.xticks(pos, words, rotation='vertical', fontproperties=font)
    
    # 그래프의 제목을 설정합니다.
    plt.title('단어 별 사용 빈도', fontproperties=font)
    
    # Y축에 설명을 추가합니다.
    plt.ylabel('빈도', fontproperties=font)
    
    # 단어가 잘리지 않도록 여백을 조정합니다.
    plt.tight_layout()
    
    # 그래프를 표시합니다.
    plt.savefig('graph.png')
    elice_utils.send_image('graph.png')


def main(prefix=''):
    # import_corpus() 함수를 통해 튜플의 리스트를 생성합니다.
    corpus = import_corpus('corpus.txt')
    # head로 시작하는 단어들만 골라 냅니다.
    prefix_words = filter_by_prefix(corpus, prefix)
    
    # 주어진 prefix로 시작하는 단어들을 빈도가 높은 순으로 정렬한 뒤 앞의 10개만 추립니다.
    top_ten = most_frequent_words(prefix_words, 10)
    
    # 단어 별 빈도수를 그래프로 나타냅니다.
    draw_frequency_graph(top_ten)
    
    # 'Alice in Wonderland' 책의 단어를 corpus로 바꿉니다.
    alice_files = ['alice/chapter{}.txt'.format(chapter) for chapter in range(1, 6)]
    alice_corpus = create_corpus(alice_files)
    
    top_ten_alice = most_frequent_words(alice_corpus, 10)
    draw_frequency_graph(top_ten_alice)


if __name__ == '__main__':
    main()

넷플릭스 시청 데이터 분석하기

실습자료

회원 별로 시청한 작품 정리하기
두 작품의 유사도 비교하기
예상 선호도 점수 구하기
작성해야 하는 함수
- preprocess_data()
- reformat_data()
- get_closeness()
- predict_preference()
세부 구현 사항

preprocess_data(filename)

입력 받은 JSON 형식의 데이터를 딕셔너리 형태로 변환하여 반환합니다. 이때 int()를 이용해 key를 정수로 설정합니다.

JSON 데이터의 key는 영화 id, value는 유저 id가 담긴 리스트로 구성되어있습니다.

반환 예시

{영화 id : [유저 id1, 유저 id2, ...],  ...}

각 영화 id에 대해 해당 영화를 시청한 유저들의 id를 담고 있는 리스트가 매칭된 딕셔너리를 반환합니다.

reformat_data(title_to_users)

작품 별 시청한 사용자 데이터인 title_to_users가 주어집니다. 이 데이터를 사용자 별 시청 작품이 각각 key와 value로 담긴 딕셔너리로 반환합니다.

반환 예시

{유저 id : [영화 id1, 영화 id2, ...],  ...}

get_closeness(title_to_users, title1, title2)

두 작품의 유사도를 구합니다. 이때 유사도는 (두 작품을 모두 본 사용자) / (두 작품 중 하나라도 본 사용자) 와 같은 형태로 구합니다.

predict_preference(title_to_users, user_to_titles, user, title)

작품1과 사용자A가 주어졌을 때, 예상 선호도를 계산합니다. 작품1에 대한 사용자A의 예상 선호도는 사용자A가 시청한 모든 작품과 작품1 유사도의 평균값입니다.

예를 들어, 사용자A가 시청한 3개의 작품과 작품1의 유사도가 0.6, 0.4, 0.5일 때, 선호도 점수는 0.5입니다.

import matplotlib.pyplot as plt
import json
from operator import itemgetter

from elice_utils import EliceUtils
from movies import titles


def preprocess_data(filename):
    processed = {}
    with open(filename) as file:
        # 입력 받은 JSON 파일을 불러와 loaded에 저장합니다.
        loaded = file.read()
        # JSON 형식의 데이터에서 영화와 사용자 정보를 하나씩 가져옵니다.
        tmp = json.loads(loaded)
    
        processed = {int(k): v for k,v in tmp.items()}
        
            # processed 딕셔너리에 title을 키로, user를 값으로 저장합니다.
        return processed

def reformat_data(title_to_users):
    user_to_titles = {}
    # 입력받은 딕셔너리에서 영화와 사용자 정보를 하나씩 가져옵니다.
    for title, users  in title_to_users.items():
        # user_to_titles에 사용자 정보가 있을 경우 사용자의 영화 정보를 추가합니다. 이때 영화 정보는 리스트형으로 저장됩니다.  
        for user in users:
            if user_to_titles.get(user):
                user_to_titles[user].append(title)
            else:
                user_to_titles[user] = [title]
                
            # user_to_titles에 사용자 정보가 있을 경우 사용자 정보와 영화 정보를 추가합니다. 이때 영화 정보는 리스트형으로 저장됩니다. 
    return user_to_titles


def get_closeness(title_to_users, title1, title2):
    # title_to_users를 이용해 title1를 시청한 사용자의 집합을 저장합니다.
    title1_users = set(title_to_users[title1])
    # title_to_users를 이용해 title2를 시청한 사용자의 집합을 저장합니다.
    title2_users = set(title_to_users[title2])
    
    # 두 작품을 모두 본 사용자를 구합니다.
    both = len(title1_users & title2_users)
    # 두 작품 중 하나라도 본 사용자를 구합니다.
    either = len(title1_users | title2_users)
    
    return both/either


def predict_preference(title_to_users, user_to_titles, user, title):
    # user_to_titles를 이용해 user가 시청한 영화를 저장합니다.
    titles = user_to_titles[user]
    # get_closeness() 함수를 이용해 유사도를 계산합니다.
    closeness = []
    
    for my_title in titles:
        closeness.append(get_closeness(title_to_users,my_title,title))

    return sum(closeness) / len(closeness)

def main():
    filename = 'netflix.json'
    title_to_users = preprocess_data(filename)
    user_to_titles = reformat_data(title_to_users)
    
    lotr1 = 2452                # 반지의 제왕 - 반지 원정대
    lotr2 = 11521               # 반지의 제왕 - 두 개의 탑
    lotr3 = 14240               # 반지의 제왕 - 왕의 귀환
    
    killbill1 = 14454           # 킬 빌 - 1부
    killbill2 = 457             # 킬 빌 - 2부
    
    jurassic_park = 14312       # 쥬라기 공원
    shawshank = 14550           # 쇼생크 탈출
    
    print("[유사도 측정]")
    #값을 바꿔가며 실행해보세요.
    title1 = lotr1
    title2 = killbill1
    description = "{}와 {}의 작품 성향 유사도".format(titles[title1], titles[title2])
    closeness = round(get_closeness(title_to_users, title1, title2) * 100)
    print("{}: {}%".format(description, closeness))
    
    username = 'elice'
    new_utt = user_to_titles.copy()
    new_utt[username] = [lotr1, lotr2, lotr3]
    
    print("[{} 사용자를 위한 작품 추천]".format(username))
    preferences = [(title, predict_preference(title_to_users, new_utt, 'elice', title)) for title in title_to_users]
    preferences.sort(key=itemgetter(1), reverse=True)
    for p in preferences[:10]:
        print("{} ({}%)".format(titles[p[0]], round(p[1] * 100)))


if __name__ == "__main__":
    main()

인기 있는 TED 강연 분석하기

실습자료

TED 강연의 관련 통계 자료를 분석하고, 인기 있는 강연은 어떤 요소들이 있는지 찾아내 보겠습니다.

데이터 형식

강의 데이터가 담긴 ted.csv 파일은 다음과 같은 내용으로 구성되어 있습니다. 1열부터 순서대로입니다.

댓글 개수
강연에 대한 부가 설명
강연 길이 (초 단위)
행사명 (예: TED2009)
녹화 일자
번역된 언어 수
연사 이름
연사 이름과 강연 제목
연사 수
강연 공개 일자
강연 평가 (JSON 형식, CSV 파일을 직접 참고하세요)
연관된 강연
연사 직업
태그 (관련 키워드)
강연 제목
강연 동영상 URL 주소
조회수

작성해야 하는 함수
- preprocess_talks(csv_file)
- get_popular_tags(talks, n)
- completion_rate_by_duration(talks)
- views_by_languages(talks)
- show_ratings()
세부 구현 사항

preprocess_talks(csv_file)

주어진 CSV 파일을 열어 처리 가능한 파이썬의 리스트 형태로 변환합니다. 리스트의 각 원소는 코드에서 설명된 딕셔너리 형태로 이루어져 있습니다.

get_popular_tags(talks, n)

가장 인기 있는 태그 상위 n개를 반환합니다. 태그의 인기도는 해당 태그를 포함하는 강의들의 조회수 합으로 결정됩니다. 예를 들어, ‘education’ 태그가 포함된 강의가 총 15개라면 ‘education’ 태그의 인기도는 이 15개 강의의 조회수 총합입니다.

호출 예시

get_popular_tags(talks, 10)

반환 예시

['culture', 'technology', 'science', 'business', 'TEDx', 'global issues', 'entertainment', 'design', 'psychology', 'brain']

completion_rate_by_duration(talks)

강연의 길이에 따라 강의를 끝까지 들은 비율(완수도)이 어떻게 변화하는지 확인합니다.

강의를 끝까지 들은 비율은 (댓글 개수 / 조회수)에 비례한다고 가정합니다. 완수도를 나타내는 completion_rates와 강의 길이를 나타내는 duration을 반환합니다.

views_by_languages(talks)

지원되는 언어의 수에 따른 조회수를 scatter plot으로 나타내기 위해 각각의 언어와 조회수를 세어 반환합니다.

조회수를 나타내는 views와 언어의 수를 나타내는 languages를 반환합니다.

호출 예시

views_by_languages(talks)

반환 예시

([언어 수 1, 언어 수 2, 언어 수 3... ], [조회수 1, 조회수 2, 조회수 3...])

show_ratings(talk)

강의에 대한 다양한 평가(rating)를 막대그래프로 표현하기 위해 각각의 평가 키워드와 횟수를 세어 반환합니다.

각 키워드를 나타내는 keywords와 횟수를 나타내는 counts를 반환합니다.

호출 예시

show_ratings(talks[0])

반환 예시

(['Funny', 'Beautiful', 'Ingenious', 'Courageous', 'Longwinded', 'Confusing', 'Informative',
'Fascinating', 'Unconvincing', 'Persuasive', 'Jaw-dropping', 'OK', 'Obnoxious', 'Inspiring'], 
[19645, 4573, 6073, 3253, 387, 242, 7346, 10581, 300, 10704, 4439, 1174, 209, 24924])

import csv
import json
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

from operator import itemgetter

from elice_utils import EliceUtils
elice_utils = EliceUtils()


def jsonify(data):
    return json.loads(data.replace("'", '"'))


def preprocess_talks(csv_file):
    # 강연 데이터를 저장할 빈 리스트를 선언합니다.
    talks = []
    
    # CSV 파일을 열고, 데이터를 읽어 와서 talks에 저장합니다.
    with open(csv_file) as file:
        reader = csv.reader(file, delimiter = ',')
        for row in reader:
            try:
                talk = {
                    'title': row[14],     # 강연의 제목
                    'speaker': row[6],   # 강연자의 이름
                    'views': int(row[16]),     # 조회수
                    'comments': int(row[0]),  # 댓글의 개수
                    'duration': int(row[2]),  # 강연 길이
                    'languages': int(row[5]), # 지원하는 언어의 수
                    'tags': jsonify(row[13]),      # 관련 태그 (키워드)
                    'ratings': jsonify(row[10]),   # 강의에 대한 평가
                }
            except:
                pass
            talks.append(talk)
    
    return talks


def get_popular_tags(talks, n):
    # 태그 별 인기도를 저장할 딕셔너리
    tag_to_views = {}
    
    # 태그 별 인기도를 구해 tag_to_views에 저장합니다.
    for talk in talks:
        for tag in talk['tags']:
            if tag_to_views.get(tag):
                tag_to_views[tag] += talk['views']
            else :
                tag_to_views[tag] = talk['views']
                
    
    # (태그, 인기도)의 리스트 형식으로 변환합니다.
    tag_view_pairs = list(tag_to_views.items())
    
    

    # 인기도가 높은 순서로 정렬해 앞의 n개를 취합니다.
    # n개를 취한 후에는 태그만 남깁니다.
    # 힌트: itemgetter()를 사용하세요!
    top_tag_and_views = sorted(tag_view_pairs, key=itemgetter(1), reverse=True)[:n]
    
    top_tags = map(lambda x : x[0] , top_tag_and_views)
    
    return list(top_tags)


def completion_rate_by_duration(talks):
    durations = []
    completion_rates = []
    for talk in talks:
        durations.append(talk['duration'])
        completion_rates.append(talk['comments']/talk['views'])

    scatter_plot(durations, completion_rates, '강의 길이', '완수도')
    
    return completion_rates, durations


def views_by_languages(talks):
    languages = []
    views = []
    
    for talk in talks:
        languages.append(talk['languages'])
        views.append(talk['views'])
    
    scatter_plot(languages, views, '언어의 수', '조회수')
    
    # 채점을 위해 결과를 리턴합니다.
    return views, languages


def show_ratings(talk):

    
    
        
    keywords = []
    counts = []
    for rating in talk['ratings']:
        keywords.append(rating['name'])
        counts.append(rating['count'])
    
    bar_plot(keywords, counts, '키워드', 'rating의 수')
    
    # 채점을 위해 결과를 리턴합니다.
    return keywords, counts


def scatter_plot(x, y, x_label, y_label):
    font = fm.FontProperties(fname='./NanumBarunGothic.ttf')
    
    plt.scatter(x, y)
    plt.xlabel(x_label, fontproperties=font)
    plt.ylabel(y_label, fontproperties=font)
    
    plt.xlim((min(x), max(x)))
    plt.ylim((min(y), max(y)))
    plt.tight_layout()
    
    plot_filename = 'plot.png'
    plt.savefig(plot_filename)
    elice_utils.send_image(plot_filename)


def bar_plot(x_ticks, y, x_label, y_label):
    assert(len(x_ticks) == len(y))
    
    font = fm.FontProperties(fname='./NanumBarunGothic.ttf')
    
    pos = range(len(y))
    plt.bar(pos, y, align='center')
    plt.xticks(pos, x_ticks, rotation='vertical', fontproperties=font)
    
    plt.xlabel(x_label, fontproperties=font)
    plt.ylabel(y_label, fontproperties=font)
    plt.tight_layout()
    
    plot_filename = 'plot.png'
    plt.savefig(plot_filename)
    elice_utils.send_image(plot_filename)


def main():
    src = 'ted.csv'
    talks = preprocess_talks(src)
    print(get_popular_tags(talks, 10))
    completion_rate_by_duration(talks)
    views_by_languages(talks)
    show_ratings(talks[0])


if __name__ == "__main__":
    main()

Share on

Twitter Facebook LinkedIn

Lim Junhyeong

파이썬 실전 데이터 분석

트럼프 대통령 트윗 분석하기

영어 단어 모음 분석하기

넷플릭스 시청 데이터 분석하기

인기 있는 TED 강연 분석하기

Share on

Leave a comment

You may also enjoy

mariaDB 환경설정 계정 생성 및 권한 부여

프로젝트에 필요한 Git

여러개의 원격 저장소(git)

벽 부수고 이동하기 4_16946