top of page

검색 결과

61 results found with an empty search

  • SageMaker :Enhancing Private LLM with QLora Technique(Fine-tuning with SageMaker HuggingFace DLC)

    SageMaker : Enhancing Private LLM with QLora Technique(Fine-tuning with SageMaker HuggingFace DLC) Written by Hyeonmin Kim Fine-tuning refers to the method of strengthening specific domains by utilizing datasets to train artificial intelligence models. Today, based on the architecture above, we will: Use SageMaker Jumpstrart Model(Mistral) with QLora technique Utilize SageMaker HuggingFace DLC(Deep Learning Containers) Fine-tune LLM using nlpai-lab/databricks-dolly-15k-ko dataset Finally enable Korean inference on the Mistral-7B model, which previously couldn't handle Korean The overall flow is as follows: After loading the dataset and creating a training dataset suitable for the Mistral model, we use SageMaker Training to run QLora technique scripts. At this time, we use SageMaker HuggingFace DLC to configure the ML learning environment as a container. After training is complete, we deploy to an Endpoint using HuggingFace Inference Container. Users only need to make requests to this Endpoint. Environment Setup Data Preparation Training Process Model Deployment Korean Inference Testing General Jumpstart Base Model Model with Korean Dataset Applied Environment Setup First, we need an environment to configure the ML pipeline. Refer to previous posts to set up the SageMaker Canvas environment and create a NoteBook for building ML pipelines. Once the environment is ready, install the necessary Python packages and huggingface-cli for using HuggingFace Hub. !pip install "transformers==4.34.0" "datasets[s3]==2.13.0" "sagemaker>=2.190.0" "gradio==3.50.2" "huggingface_hub[cli]" --upgrade --quiet After installation, you need to log into huggingface-cli. The required token value can be found at huggingface.co  under Profile - Edit Profile - Access Tokens. Once you have the token value, proceed with login: !huggingface-cli login --token hf_xxxxxxxxxxxxxxxxxxx Finally, specify the IAM Role and default bucket: import sagemaker import boto3 sess = sagemaker.Session() sagemaker_session_bucket = None if sagemaker_session_bucket is None and sess is not None : sagemaker_session_bucket = sess.default_bucket() try : role = sagemaker.get_execution_role() except ValueError: iam = boto3.client( 'iam' ) role = iam.get_role(RoleName = 'sagemaker_execution_role' )[ 'Role' ][ 'Arn' ] sess = sagemaker.Session(default_bucket = sagemaker_session_bucket) print ( f"sagemaker role arn: {role} " ) print ( f"sagemaker bucket: {sess.default_bucket()} " ) print ( f"sagemaker session region: {sess.boto_region_name} " ) Data Preparation First, we need to prepare a Korean training set. We'll use nlpai-lab/databricks-dolly-15k-ko, which is a Korean translation of the databricks-dolly dataset provided by Korea University research lab that developed the Kullm model. The databricks-dolly dataset is an open source created by Databricks, containing instructions including brainstorming, classification, private QA, generation, information extraction, public QA, and summarization. The total number of data points is 15,011. Since we installed the Datasets library, we can easily load the data: from datasets import load_dataset from random import randrange # Load dataset from the hub dataset = load_dataset( "nlpai-lab/databricks-dolly-15k-ko" , split= "train" ) print ( f"dataset size: { len (dataset)} " ) print (dataset[randrange( len (dataset))]) When the Mistral model learns data, it must follow a specific format to recognize the data. We need to change the dolly dataset format accordingly. The Mistral model distinguishes data with ### Instruction, ### Context (optional), ### Answer, and we need to change the format to match this. def format_dolly (sample): instruction = f"### Instruction\n {sample[ 'instruction' ]} " context = f"### Context\n {sample[ 'context' ]} " if len (sample[ "context" ]) > 0 else None response = f"### Answer\n {sample[ 'response' ]} " # join all the parts together prompt = "\n\n" .join([i for i in [instruction, context, response] if i is not None ]) return prompt print (format_dolly(dataset[randrange( len (dataset))])) Now we need to create and initialize a tokenizer. We'll use HuggingFace's AutoTokenizer to create an auto tokenizer suitable for the Mistral model: from transformers import AutoTokenizer model_id = "mistralai/Mistral-7B-v0.1" tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token =True ) Now, create a data preprocessing pipeline using the tokenizer created above. Apply templates to the dataset, tokenize, and then divide the dataset into chunks: from random import randint import sys sys.path.append( "../scripts/utils" ) from pack_dataset import pack_dataset def template_dataset (sample): sample[ "text" ] = f" {format_dolly(sample)}{tokenizer.eos_token} " return sample dataset = dataset .map (template_dataset, remove_columns = list (dataset.features)) # print random sample print (dataset[randint( 0 , len (dataset))][ "text" ]) # tokenize dataset dataset = dataset .map ( lambda sample: tokenizer(sample[ "text" ]), batched = True , remove_columns = list (dataset.features) ) # chunk dataset lm_dataset = pack_dataset(dataset, chunk_length = 2048 ) # Print total number of samples print ( f"Total number of samples: { len (lm_dataset)} " ) The source code for the used util pack_dataset is as follows: from itertools import chain from functools import partial remainder = { "input_ids" : [], "attention_mask" : [], "token_type_ids" : []} # empty list to save remainder from batches to use in next batch def pack_dataset (dataset, chunk_length = 2048 ): print ( f"Chunking dataset into chunks of {chunk_length} tokens." ) def chunk (sample, chunk_length = chunk_length): # define global remainder variable to save remainder from batches to use in next batch global remainder # Concatenate all texts and add remainder from previous batch concatenated_examples = {k: list (chain(*sample[k])) for k in sample.keys()} concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()} # get total number of tokens for batch batch_total_length = len(concatenated_examples[ list (sample.keys())[ 0 ]]) # get max number of chunks for batch if batch_total_length >= chunk_length: batch_chunk_length = (batch_total_length // chunk_length) * chunk_length # Split by chunks of max_len. result = { k: [t[i : i + chunk_length] for i in range ( 0 , batch_chunk_length, chunk_length)] for k, t in concatenated_examples.items() } # add remainder to global variable for next batch remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()} # prepare labels result[ "labels" ] = result[ "input_ids" ].copy() return result # tokenize and chunk dataset lm_dataset = dataset .map ( partial(chunk, chunk_length=chunk_length), batched = True , ) print ( f"Total number of samples: { len (lm_dataset)} " ) return lm_dataset Once the training data is complete, save it to S3 for use in SageMaker Training Job: training_input_path = f's3:// {sess.default_bucket()} /processed/mistral/dolly-ko/train' lm_dataset.save_to_disk(training_input_path) print ( "uploaded data to:" ) print ( f"training dataset to: {training_input_path} " ) Training Process With the dataset prepared, let's proceed with training using SageMaker's training job. The script used for training was the QLoRA script. QLoRA is an efficient fine-tuning approach that reduces memory usage enough to fine-tune a 65B parameter model on a single 48GB GPU while maintaining full 16-bit fine-tuning task performance. QLoRA backpropagates gradients to LoRA (Low Rank Adapter) through fixed 4-bit quantized pre-trained language models. The script can be found at: https://github.com/artidoro/qlora/blob/main/qlora.py Back to SageMaker, let's define parameters for training: from huggingface_hub import HfFolder # hyperparameters, which are passed into the training job hyperparameters = { 'model_id' : model_id, 'dataset_path' : '/opt/ml/input/data/training' , 'num_train_epochs' : 3 , 'per_device_train_batch_size' : 6 , 'gradient_accumulation_steps' : 2 , 'gradient_checkpointing' : True , 'bf16' : True , 'tf32' : True , 'learning_rate' : 2e-4 , 'max_grad_norm' : 0.3 , 'warmup_ratio': 0.03, "lr_scheduler_type":"constant" , 'save_strategy' : "epoch" , "logging_steps" : 10 , 'merge_adapters' : True , 'use_flash_attn' : True , 'output_dir' : '/opt/ml/checkpoints' , if HfFolder.get_token() is not None : hyperparameters[ 'hf_token' ] = HfFolder.get_token() We need to define an Estimator. Since we'll use SageMaker HuggingFace DLC, we'll use HuggingFace Estimator. It's integrated into the SageMaker SDK for easy use, so we can define it easily: from sagemaker.huggingface import HuggingFace # define Training Job Name job_name = f'huggingface-qlora- {hyperparameters[ "model_id" ].replace( "/","-" ).replace( ".","-" )}' chekpoint_s3 = f's3:// {sess.default_bucket()} /checkpoints' # create the Estimator huggingface_estimator = HuggingFace( entry_point = 'run_qlora.py' , source_dir = '../scripts' , instance_type = 'ml.g5.4xlarge' , instance_count = 1 , checkpoint_s3_uri = chekpoint_s3, max_run = 2*24*60*60 , base_job_name = job_name, role = role, volume_size = 300 , transformers_version = '4.28' , pytorch_version = '2.0' , py_version = 'py310' , hyperparameters = hyperparameters, environment = { "HUGGINGFACE_HUB_CACHE" : "/tmp/.cache" }, disable_output_compression = True ) Now that all training preparations are complete, let's define the data and proceed with fitting: data = { 'training' : training_input_path} huggingface_estimator.fit(data, wait = True ) Deep learning containers for training are provisioned and training proceeds. When training is in progress, you can check the progress in SageMaker Training and monitor it through CloudWatch. Training took a total of 29,172 seconds (approximately 486.2 minutes, 8.1 hours), and since we used g5.4xlarge, it cost $13.16. SageMaker Training works similarly to AWS Batch, with costs only occurring for the time performing work, reducing unnecessary time for provisioning and maintaining instances, making it much cheaper than training on general GPU instances. Model Deployment Once model training is complete, we need to deploy it. First, let's check if the model was created normally. The model can be found at output/model with Training Job as prefix in the default s3 path. Once we confirm the model is saved properly, let's proceed with deployment. First, we need an image for deployment. Again, we'll load the inference environment image provided by huggingface: from sagemaker.huggingface import get_huggingface_llm_image_uri # retrieve the llm image uri llm_image = get_huggingface_llm_image_uri( "huggingface" , version = "1.1.0" , session = sess, ) # print ecr image uri print ( f"llm image uri: {llm_image} " ) Using this image, let's define the LLM model. Define environment variables used for deployment and input the S3 URI of the previously confirmed model in model data: import json from sagemaker.huggingface import HuggingFaceModel instance_type = "ml.g5.2xlarge" number_of_gpu = 1 health_check_timeout = 300 config = { 'HF_MODEL_ID': "/opt/ml/model" , 'SM_NUM_GPUS': json.dumps(number_of_gpu), 'MAX_INPUT_LENGTH': json.dumps( 1024 ), 'MAX_TOTAL_TOKENS' : json.dumps( 2048 ), } llm_model = HuggingFaceModel( role = role, image_uri = llm_image, model_data = { 'S3DataSource' :{ 'S3Uri' : model_s3_path, 'S3DataType' : 'S3Prefix','CompressionType' : 'None' }}, env = config ) Now that we've defined the model, we just need to deploy: llm = llm_model.deploy( initial_instance_count = 1 , instance_type = instance_type, container_startup_health_check_timeout = health_check_timeout, ) The deployment process took about 10 minutes. Korean Inference Testing Now, to check if Korean was properly learned, let's compare the Mistral JumpStart model with the model I trained by requesting prompts. The inference code was written as follows: import json import boto3 newline, bold, unbold = "\n", "\033[1m", "\033[0m" endpoint_name = "엔드포인트 이름" def query_endpoint (payload): client = boto3.client( "runtime.sagemaker" ) response = client.invoke_endpoint( EndpointName=endpoint_name, InferenceComponentName= '추론 컴포넌트 이름(없다면 생략)' , ContentType="application/json", Body=json.dumps(payload).encode("utf-8") ) model_predictions = json.loads(response[ "Body" ].read()) generated_text = model_predictions[ 0 ][ "generated_text" ] print ( f"Input Text: {payload[ 'inputs' ]}{newline} " f"Generated Text: {bold}{generated_text}{unbold}{newline} " ) General Jumpstart Base Model Model with Korean Dataset Applied When asking the same question "대한민국 수도 서울에 대해 알려줘(Tell me about Seoul, the capital of South Korea)", you can see that the general model responds in English. Since the base model doesn't support Korean, an English response is the best it can do, and sometimes it doesn't recognize Korean at all and just produces nonsense. However, you can confirm that the model trained with the Korean dataset provides correct answers in Korean. Today, we fine-tuned the Mistral model, which doesn't support Korean, by training it with a Korean dataset to enable Korean usage. Since the dataset used for training was 15k, which isn't a large amount of data, we could only confirm that Korean is now possible to some extent. This issue would be resolved by using more datasets. Since training with more datasets would incur considerable costs, we plan to reduce costs by using spot instances for training. We'll share updates on this part when available. This concludes our post. Thank you.

  • SageMaker : Building Your Own Image Generation Model (with FineTuning)

    SageMaker : Building Your Own Image Generation Model (with Fine-Tuning) Written by Hyeonmin Kim Generative AI models are broadly divided into two categories: image generation models and text generation models. Each generative model processes various types of input but produces output in two main formats: images and text. Today, these models are commonly used to create chatbots or generate image assets for various applications. Today, based on the architecture above, we will: Deploy an image generation model easily using SageMaker Train custom images through SageMaker Training Job Use the fine-tuned image generation model to create images based on our custom dataset SageMaker introduced the JumpStart feature in December 2020. Since then, the model collection has steadily grown, and in November 2022, it became possible to easily deploy Stable Diffusion models. We will use this JumpStart feature to leverage Stable Diffusion , a text-to-image model, to train our custom image s through the learning process. Environment Setup SageMaker Environment Configuration Adding Training Image Dataset Configuring Training/Output S3 Buckets SageMaker Training Setup Fine-tuning with SageMaker Training Job Deploying the Fine-tuned Model Testing the Fine-tuned Model Environment Setup To follow this post, you'll need a SageMaker Studio environment. (Reference: SmileShark Blog - Getting Started with GenAI Fine-tuning on AWS) First, access SageMaker Studio and create a Jupyter notebook (.ipynb) environment before proceeding. If you see a Jupyter environment like this, you're ready to go. SageMaker Environment Configuration Before starting this work, let's check the session to grant permissions to access related AWS resources. import botocore import sagemaker, boto3, json from sagemaker import get_execution_role import os iam = boto3.client( "iam" ) aws_role = iam.get_role(RoleName= "hmkim-sagemaker-full" )[ "Role" ][ "Arn" ] boto_session = boto3.Session() aws_region = boto_session.region_name sess = sagemaker.Session(boto_session=boto_session) print (aws_role) print (aws_region) print (sess.boto_region_name) If the IAM role and two region names match as shown, you've succeeded. Now let's create a directory for training images and add the images to be trained. The following code creates a folder called training_images , where you need to add the images for training. In this post, we'll train our family dog. local_training_dataset_folder = " training_images " if not os.path.exists(local_training_dataset_folder): os.mkdir(local_training_dataset_folder) Adding Training Image Dataset Check the left menu bar to confirm the training_images folder has been created, then add images. You can use the sample images below, but we'll add images of our family dog. By the way, our dog's name is Echo and the breed is Poodle. Here are the images used for training: You can easily add images by drag and drop. Now that we've added the image dataset to be trained, we need to tell the system what these images represent. We'll add information about "what they represent" in a dataset_info.json   file. Use the following code to create it. I wrote a prompt describing it as a photo of a dog named Echo. instance_prompt = "A photo of a dog named echo" import os import json with open (os.path.join(local_training_dataset_folder, "dataset_info.json" ), "w") as f: f.write(json.dumps({ "instance_prompt" : instance_prompt})) The dataset_info.json file is created with the following content. Configuring Training/Output S3 Buckets Now, let's specify the S3 bucket where SageMaker models and training data will be stored. We can use AWS-provided utils.py to automatically create it if it doesn't exist. mySession = boto3.session.Session() AwsRegion = mySession.region_name account_id = boto3.client( "sts" ).get_caller_identity().get( "Account" ) training_bucket = f"stable-diffusion-jumpstart -{AwsRegion}-{account_id}" s3 = boto3.client( "s3" ) s3.download_file( f"jumpstart-cache-prod- {AwsRegion}", "ai_services_assets/custom_labels/cl_jumpstart_ic_notebook_utils.py" , "utils.py" , ) from utils import create_bucket_if_not_exists create_bucket_if_not_exists(training_bucket) Specify the S3 path for training. Then copy (upload) the training image set previously saved as training_images to S3. train_s3_path = f"s3:// {training_bucket} /custom_dog_stable_diffusion_dataset/" !aws s3 cp -- recursive $local_training_dataset_folder $train_s3_path Checking the actual bucket, you can confirm the images have been uploaded. Specify the S3 location where results will be output. output_bucket = sess.default_bucket() output_prefix = "jumpstart-example-sd-training" s3_output_location = f"s3:// {output_bucket} / {output_prefix} /output" SageMaker Training Setup Now let's fine-tune the stable diffusion model with these images. As mentioned earlier, stable diffusion was added as a jumpstart model, making it easy to specify the base model for training. We'll use the model-txt2img-stabilityai-stable-diffusion-v2-1-base model. from sagemaker import image_uris, model_uris, script_uris train_model_id, train_model_version, train_scope = ( "model-txt2img-stabilityai-stable-diffusion-v2-1-base", "*", "training" , ) training_instance_type = "ml.g4dn.2xlarge" train_image_uri = image_uris.retrieve( region = None , framework = None , model_id = train_model_id, model_version = train_model_version, image_scope = train_scope, instance_type = training_instance_type, ) train_source_uri = script_uris.retrieve( model_id = train_model_id, model_version = train_model_version, script_scope = train_scope ) train_model_uri = model_uris.retrieve( model_id = train_model_id, model_version = train_model_version, model_scope = train_scope ) The SageMaker SDK automatically retrieves the environment, source, and models needed for training. This is one of SageMaker's major advantages. Define hyperparameters for tuning. The SageMaker SDK provides default hyperparameters suitable for the model, so you only need to modify the parts you want to change. I'll adjust only the max_step. from sagemaker import hyperparameters hyperparameters = hyperparameters.retrieve_default( model_id=train_model_id, model_version = train_model_version ) hyperparameters[ "max_steps" ] = "200" print (hyperparameters) Fine-tuning with SageMaker Training Job Let's proceed with fine-tuning using SageMaker SDK (Training). Running the following code creates an estimator including the image set and resources defined to be provided to the model, then proceeds with fitting by referencing S3. This process takes about 10 minutes. % time from sagemaker.estimator import Estimator from sagemaker.utils import name_from_base from sagemaker.tuner import HyperparameterTuner training_job_name = name_from_base( f"jumpstart-example- {train_model_id} -transfer-learning" ) sd_estimator = Estimator( role = aws_role, image_uri = train_image_uri, source_dir = train_source_uri, model_uri = train_model_uri, entry_point = "transfer_learning.py" , instance_count = 1 , instance_type = training_instance_type, max_run = 360000 , hyperparameters = hyperparameters, output_path = s3_output_location, base_job_name = training_job_name, ) sd_estimator.fit({ "training" : train_s3_path}, logs = True ) Checking training in the SageMaker console, you can confirm the training job is running. When the training job runs, it automatically configures an ML environment as a container, goes through training, and finally outputs a custom-trained model. You can see the trained model is located in the previously specified output bucket. Deploying the Fine-tuned Model Now let's deploy this fine-tuned model to a SageMaker Endpoint. Deploy using the same SageMaker deployment method, but specify the model information as trained model information. % time inference_instance_type = "ml.g4dn.2xlarge" deploy_image_uri = image_uris.retrieve( region = None , framework = None , image_scope= "inference" , model_id = train_model_id, model_version = train_model_version, instance_type = inference_instance_type, ) deploy_source_uri = script_uris.retrieve( model_id=train_model_id, model_version=train_model_version, script_scope="inference" ) endpoint_name = name_from_base( f"jumpstart-example-FT- {train_model_id} -" ) finetuned_predictor = sd_estimator.deploy( initial_instance_count = 1 , instance_type = inference_instance_type, entry_point = "inference.py" , image_uri = deploy_image_uri, source_dir = deploy_source_uri, endpoint_name = endpoint_name, ) Testing the Fine-tuned Model Let's declare functions for the inference process to test the model. import matplotlib.pyplot as plt import numpy as np def query (model_predictor, text): """Query the model predictor.""" encoded_text = json.dumps(text).encode( "utf-8" ) query_response = model_predictor.predict( encoded_text, { "ContentType": "application/x-text" , "Accept": "application/json" , }, ) return query_response def parse_response (query_response): """Parse response and return generated image and the prompt""" response_dict = json.loads(query_response) return response_dict[ "generated_image" ], response_dict[ "prompt" ] def display_img_and_prompt (img, prmpt): """Display hallucinated image.""" plt.figure(figsize = ( 12 , 12 )) plt.imshow(np.array(img)) plt.axis( "off" ) plt.title(prmpt) plt.show() Now let's create a prompt and output an image. I asked for an image of Echo with a happy smile. (A photo of a dog named echo with happy smile) all_prompts = [ "A photo of a dog named echo with happy smile" , ] for prompt in all_prompts: query_response = query(finetuned_predictor, prompt) img, _ = parse_response(query_response) display_img_and_prompt(img, prompt) AI Echo is complete! But something seems strange. The clothes are only on the front legs. Let me generate another one. all_prompts = [ "A photo of a dog named echo that eating snacks" , ] for prompt in all_prompts: query_response = query(finetuned_predictor, prompt) img, _ = parse_response(query_response) display_img_and_prompt(img, prompt) It generates normally, but we need to add negative words to remove the clothes or other unnecessary content for practical use. The clothes are probably appearing because we used images of Echo wearing clothes during training. Let's define and use a query function for more precise prompt adjustment. from PIL import Image from io import BytesIO import base64 import json def query_endpoint_with_json_payload (model_predictor, payload, content_type, accept): """Query the model predictor with json payload.""" encoded_payload = json.dumps(payload).encode( "utf-8" ) query_response = model_predictor.predict( encoded_payload, { "ContentType" : content_type, "Accept" : accept, }, ) return query_response def parse_response_multiple_images (query_response): """Parse response and return generated image and the prompt""" response_dict = json.loads(query_response) return response_dict[ "generated_images" ], response_dict[ "prompt" ] def display_encoded_images (generated_images, title): """Decode the images and convert to RGB format and display Args: generated_images: are a list of jpeg images as bytes with b64 encoding. """ for generated_image in generated_images: generated_image_decoded = BytesIO(base64.b64decode(generated_image.encode())) generated_image_rgb = Image. open (generated_image_decoded).convert( "RGB" ) display_img_and_prompt(generated_image_rgb, title) def compressed_output_query_and_display (payload, title): query_response = query_endpoint_with_json_payload( finetuned_predictor, payload, "application/json" , "application/json;jpeg" ) generated_images, prompt = parse_response_multiple_images(query_response) display_encoded_images(generated_images, title) Now, let's specify clothes (cloth) as a negative prompt when generating images. prompt = "A photo of a dog named echo with happy smile." negative_prompt = "cloth" payload = { "prompt" : prompt, "negative_prompt" : negative_prompt, "seed" : 1 } compressed_output_query_and_display( payload, f"generated image with negative prompt: ` {negative_prompt} `" ) Now it looks a bit more like our Echo! This time, I requested an image of Echo running on the beach. prompt = "A photo of a dog named echo / the dog is running on the beach side" negative_prompt = "cloth" payload = { "prompt": prompt, "negative_prompt": negative_prompt} compressed_output_query_and_display( payload, f"generated image with negative prompt: ` {negative_prompt} `" ) It generates well as expected. Looking at both outputs, you can confirm that each output is similar to the previously trained images. The more training images we have, the more accurate the images will probably be. Today, we used SageMaker to fine-tune Stable Diffusion, one of the Jumpstart FM models, with custom images and SageMaker Training Job to create and deploy a model that can generate our own images. The range of applications seems limitless, from generating company image assets to creating virtual characters. This concludes our post.

  • AWS Lambda: The Ultimate Guide for Beginners 1/2

    Everything About AWS Lambda: The Ultimate Guide for Beginners 1/2 Written by Hyojung Yoon Today, we will learn about AWS Lambda, a key player in various IT environments. AWS Lambda enables the provision of services with high availability and scalability, thus enhancing performance and stability in cloud environments like AWS. In this blog, we'll delve into AWS Lambda, covering its basic concepts, advantages and disadvantages, and real-life use cases. Additionally, we'll compare AWS Lambda with EC2 to understand when to use each service. Let's get started! What is AWS Lambda? What is Serverless Computing? AWS Lambda How AWS Lambda Works Pros and Cons of AWS Lambda Advantages Serverless Architecture Cost-Effective Integration with AWS Services Disadvantages Execution Time Limit Stateless ColdStart Concurrency Limit Use Cases of AWS Lambda Automation of System Operations Web Applications Serverless Batch Processing Others Differences between AWS Lambda and EC2 When to Use AWS Lambda? When to Use AWS EC2? Conclusion What is AWS Lambda? 1. What is Serverless ¹ Computing? AWS Lambda is a serverless computing service. Serverless computing is a cloud computing execution model that allows the operation of backend services without managing servers. Here, you can focus solely on writing code, while AWS manages the infrastructure. This model enables developers to develop and deploy applications more quickly and efficiently. ¹Serverless? A cloud-native development model where developers don't need to provision servers or manage application scaling. Essentially, cloud providers manage server infrastructure, freeing developers to focus more on the actual functionalities they need to implement. 2. AWS Lambda AWS Lambda is an event-driven serverless computing service that enables code execution for a variety of applications and backend services without the need to provision or manage servers. Users simply provide code in a supported language runtimes ( Lambda supports Python, C#, Node.js, Ruby, Java, PowerShell, Go ). The code is structured as Lambda functions, which users can write and use as needed. AWS Lambda offers an automatically triggered code execution environment, ideal for an event-based architecture and powerful backend solutions. For example, code is executed when a file is uploaded to an S3 bucket or when a new record is added to DynamoDB. 3. How AWS Lambda Works Lambda Functions These are resources in Lambda that execute code in response to events or triggers from other AWS services. Functions include code to process events or other AWS service events that are passed to them. Event Triggers (Event Sources) AWS Lambda runs function instances to process events. These can be directly called using the Lambda API or triggered by various AWS sercies and resources. AWS Lambda functions are triggered by various events, like HTTP requests, data state transitions, file uploads, etc. How Lambda Works You create a function, add basic information, write code in the Lambda editor or upload it, and AWS handles scaling, patching, and infrastructure management. Pros and Cons of AWS Lambda Using AWS Lambda allows developers to focus on development without the burden of server management, similar to renting a car where you only drive, and maintenance is handled by the rental company. However, Lambda functions are stateless, so additional configurations are necessary for state management. Also, the 'cold start' phenomenon can slow initial response times, like a computer waking from sleep. 1. Advantages 1) Serverless Architecture Developers can focus on development without worrying about server management, akin to renting and driving a car while maintenance is handled by the rental company. 2) Cost-Effective Pay only for the computing resources actually used. Functions are called and processed only when needed, so you don't need to keep servers running all the time, making it cost-effective. Lambda charges based on the number of requests and the execution time of the Lambda code, so no charges apply when code is not running. 3) Integration with AWS Services Allows seamless integration and programmatic interactions with other AWS services. Lambda functions also allow programmatic interactions with other AWS services using one of the AWS software development kits (SDKs). 2. Disadvantages 1)Execution Time Limit Lambda has a maximum execution time of 15 minutes (900 seconds) and a maximum memory limit of 10GB (10240MB). Thus, it is not suitable for long-running processes that exceed 15 minutes. 2) Stateless ³ Not suitable for maintaining states or DB connections. - ³Stateless ? Means that data is not stored between interactions, allowing for multiple tasks to be performed at once or rapidly scaled without waiting for a task to complete. 3) ColdStart As a serverless service for efficient resource use, Lambda turns off computing power if not used for a long time. When a function is first called, additional setup is needed to run the Lambda function, leading to a delay known as a Cold Start. The cold start phenomenon varies depending on the language used and memory settings. This initial delay can affect performance by delaying responses. ※ Solutions for Cold Start 1. Use Lambda SnapStart With SnapStart, Lambda initializes the function when you publish a function version. Lambda creates a snapshot of the initialized execution environment's memory and disk state in a Firecracker microVM, encrypts it, and caches the snapshot for quick access with minimal delay. For more details, refer to AWS documentation. AWS 문서 2. Increase the Allocated Memory to Improve Specs The range of cold start delay varies with the function size, allocated memory, and code complexity. Adding memory proportionally increases CPU capacity, enhancing overall computing performance. 3. Activate Provisioned Concurrency An option to have execution environments ready in advance for immediate response to function calls. Activating this reduces delay, but incurs additional costs. 4) Concurrency ⁴ Limit By default, Lambda limits the number of Lambda functions that can be executed simultaneously to 1000 per region. Exceeding this number of requests can prevent Lambda from performing. - ⁴Concurrency ? The number of requests a Lambda function is processing at the same time. As concurrency increases, Lambda provisions more execution environment instances to meet the demand. Use Cases of AWS Lambda Lambda is ideal for applications that need to rapidly scale up and scale down to zero when there's no demand. For example, Lambda can be used for purposes like: 1. Automation of System Operations 🎬 Set up CloudWatch Alarms for all resources. When resources are in poor condition, such as Memory Full or a sudden CPU spike, CloudWatch Alarms trigger a Lambda Function. The Lambda Function notifies the team or relevant parties via Email or Slack Notification. Combine Lambda Function with Ansible for automated recovery in case of failure, such as resetting memory on a local instance or replacing resources when Memory Full occurs. 2. Web Applications 🎬 Store Static Contents (like images) in S3 when clients connect. Use CloudFront in front of S3 for fast serving globally. Separately use Cognito for authentication. For Dynamic Contents and programmatic tasks, use Lambda and API Gateway to provide services, with DynamoDB as the backend database. 3. Serverless Batch Processing 🎬 When an object enters S3, a Lambda Splitter distributes tasks to Mappers, and the Mappers save the completed tasks in DynamoDB. Lambda Reducer outputs back to S3. 4. Other Cases 1) Real-Time Lambda Data Processing Triggered by Amazon S3 Uploads. [Example] Thumbnail creation for S3 source images. 2) Stream Processing Use Lambda and Amazon Kinesis for real-time streaming data processing for application activity tracking, transaction order processing, clickstream analysis, data cleansing, log filtering, indexing, social media analysis, IoT device data telemetry, etc. 3) IoT Backend Build a serverless backend using Lambda to handle web, mobile, IoT, and third-party API requests. 4) Mobile Backend Build a backend using Lambda and Amazon API Gateway to authenticate and process API requests. Integrate easily with iOS, Android, web, and React Native frontends using AWS Amplify. Differences Between AWS Lambda & EC2 AWS Lambda is serverless and event-driven, suitable for low-complexity, fast execution tasks, and infrequent traffic. EC2, on the other hand, is ideal for high-performance computing, disaster recovery, DevOps, development, and testing, and offers a secure environment. AWS Lambda AWS EC2 Uses the serverless model, eliminating the need to manage servers. User management of OS, application updates, security, and network configurations is necessary. In the serverless environment, it automatically manages computing resources depending on the amount of traffic, and which can be very cost-effective. In contrast to serverless, it requires the user to manage various aspects such as OS, application updates, security, and network configurations. However, it offers the flexibility to optimize resources. Focuses on code execution and is highly compatible with AWS Lambda's serverless model. Users must select an operating system, configure server properties, and then deploy EC2 instances. It supports a variety of use cases, tailored to different operational scenarios. It provides a high level of user control. 1. When Should I Use AWS Lambda? Low-Complexity Code: Lambda is the perfect choice for running code with minimal variables and third-party dependencies. It simplifies the handling of easy tasks with low-complexity code. Fast Execution Time : Lambda is ideal for tasks that occur infrequently and need to be executed within minutes. Infrequent Traffic : Businesses dislike having idle servers while still paying for them. A pay-per-use model can significantly reduce computing costs. Real-Time Processing : Lambda, when used with AWS Kinesis, is best suited for real-time batch processing. Scheduled CRON Jobs : AWS Lambda functions are well-suited for ensuring scheduled events are triggered at their set times. 2. When Should I Use AWS EC2? High-Performance Computing : Using multiple EC2 instances, businesses can create virtual servers tailored to their needs, making EC2 perfect for handling complex tasks. Disaster Recovery : EC2 is used as a medium for disaster recovery in both active and passive environments. It can be quickly activated in emergencies, minimizing downtime. DevOps : DevOps processes have been comprehensively developed around EC2 Development and Testing : EC2 provides on-demand computing resources, enabling companies to deploy large-scale testing environments without upfront hardware investments. Secure Environment : EC2 is renowned for its excellent security. Conclusion This guide provided an in-depth understanding of AWS Lambda, which plays a significant role in traffic management and server load balancing in the AWS environment. In the next session, we will explore accessing the console, creating and executing Lambda functions, and understanding fee calculations. We hope this guide helps you in starting and utilizing AWS Lambda, as you embark on your journey into the expansive serverless world with AWS Lambda! Links A Deep Dive into AWS Lambda - Sungyeol Cho, System Engineer (AWS Managed Services) - YouTube What is AWS Lambda? - AWS Lambda Troubleshoot Lambda function cold start issues | AWS re:Post

  • API Gateway Cost Optimization: 10 Efficient Tips to Save Money | AWS Guide

    API Gateway Cost Optimization: 10 Efficient Tips to Save Money Written by Hyojung Yoon Hello everyone! It's been a while since our last cost-related tips blog post. Today, we're diving into Amazon API Gateway, a powerful tool that, if not managed properly, can lead to unexpected costs. In this comprehensive guide, we'll explore 10 efficient usage tips and money-saving strategies for API Gateway. Contents What is API Gateway? Key Features Traffic Management CORS Support Authentication and Authorization Management Response Caching Monitoring and Logging Why Use API Gateway? Easy and Quick API Creation and Deployment Strong Security Features Integration with Serverless Architecture API Usage and Cost Management Real-time Performance Monitoring Understanding API Gateway Cost Structure API Gateway Costs Calculating Real Costs 10 Tips for Cost Savings Conclusion What is API Gateway? API Gateway is a service that manages communication between clients and backend services, acting as a server intermediary . It provides API creation, management, monitoring, and protection regardless of scale. This allows clients to communicate safely and efficiently with servers, essentially serving as a gateway for APIs. Key Features ① Traffic Management Amazon API Gateway can handle multiple client requests simultaneously, minimizing server load and ensuring stable service operation even during high traffic periods . ② CORS Support Cross-Origin Resource Sharing (CORS) enables web applications to access resources across different domains. While browsers typically follow the Same-Origin Policy, API Gateway enables clients to call APIs from other domains. ③ Authentication and Authorization Management API Gateway integrates with AWS IAM and Amazon Cognito to provide user-specific permission management and authentication , protecting data and blocking unauthorized access. ④ Response Caching API Gateway's response caching feature can pre-store frequently requested data, reducing server load and improving response times. By using cached data, it optimizes performance and reduces costs by minimizing unnecessary requests to backend servers. ⑤ Monitoring and Logging API Gateway integrates with AWS CloudWatch to monitor API performance in real-time and log request and response statuses. This allows for quick detection and response to performance issues or errors in the API. Why Use API Gateway? 1. Easy and Quick API Creation and Deployment REST, HTTP, and WebSocket APIs can be easily created and deployed without complex infrastructure setup, improving development productivity. 2. Strong Security Features Integration with AWS IAM allows for detailed authentication and authorization management , preventing unauthorized access to APIs. 3. Integration with Serverless Architecture API Gateway integrates seamlessly with serverless services like AWS Lambda , making it easy to build scalable serverless applications without worrying about server management. 4. API Usage and Cost Management Through API keys , you can manage usage per client and set usage plans, preventing API abuse and unnecessary costs. 5. Real-time Performance Monitoring API Gateway integrates with AWS CloudWatch, allowing real-time monitoring of API performance and usage. You can track metrics such as request count, latency, and error rates to quickly resolve performance issues and maintain service stability. Understanding API Gateway Cost Structure API Gateway costs are determined by the number of API calls, the volume of data transferred, and the use of additional features . Let's look at these three items in the table below. By API type, HTTP API is the cheapest, REST API has moderate costs due to additional features, and WebSocket API is relatively more expensive as it supports real-time communication. Data transfer fees are mostly free for transfers within the cloud, but outbound data transfers (to external networks) incur additional charges. Also, using additional features like caching and data transformation incurs extra charges based on usage. Calculating Real Costs Let's calculate the cost assuming 25 HTTP API calls per minute for a month (30 days), with each call uploading 4.5MB of data. Total API Call Calculation: Monthly total API calls = 25 calls/minute x 60 minutes/hour x 24 hours/day x 30 days/month = 1,080,000 calls/month Request Number Calculation: Each call's data is 4.5MB, calculated in 512KB units. Requests = 4.5MB x 1024KB / 512KB = 9 requests , so each call is processed as 9 requests. Monthly total requests = 1,080,000 calls x 9 requests/call = 9,720,000 requests/month Cost Calculation: As the total number of requests is less than 300 million, the first 300 million request tier applies. Total cost = 9,720,000 requests ÷ 1,000,000 requests x $1.23 = $11.95 Therefore, the total number of requests from 1,080,000 API calls per month is 9,720,000 and the corresponding cost is $11.95 . 10 Tips for Cost Savings 1. Develop a Caching Strategy By caching frequently requested data in advance, you can reduce both server load and costs. Set cache expiration times to match data change cycles and maintain an appropriate cache size to increase efficiency. Case Study : Betterfly Optimizes Dynamic Insurance Using API Gateway 2. Handle Request/Response Transformations in API Gateway Instead of Lambda Simple request/response transformations can be handled using API Gateway's mapping templates instead of Lambda. This reduces the number of Lambda invocations, cutting costs. 3. Choose the Optimal Integration Type AWS offers both REST API and HTTP API. HTTP API is cheaper and more economical when only basic features are needed. Use HTTP API for simple data retrieval or update operations. 4. Use Stage Variables and Environments Using stage variables allows you to reuse the same API in development, test, and production environments. This separates environment-specific settings and reduces operational management costs. https://${stageVariables.url}/resource 5. Set Usage Plans Setting usage plans to limit API call volume can prevent unnecessary costs. Control call volume by client to prevent service abuse and enable predictable cost management. 6. Integrate with CloudFront to Reduce Data Transfer Costs Placing CloudFront in front of API Gateway can significantly reduce data transfer costs. Data transfer between API Gateway and CloudFront is free, making it advantageous for global traffic management. 7. Remove Unnecessary Steps Removing unnecessary authentication steps or complex logic can speed up response times and reduce costs. Simplify excessive authentication procedures and reduce unnecessary backend calls to optimize API usage. 8. Optimize Logging Levels Excessive CloudWatch logging can dramatically increase costs. Optimize log levels to record only necessary information, such as error logs or detailed logging only during development. 9. Regular Usage Monitoring and Analysis Detect abnormal traffic patterns and analyze usage in real-time through CloudWatch monitoring to find opportunities for cost reduction. 10. Consider Dedicated API Gateway for Large-scale Traffic For large-scale traffic, consider using a dedicated API Gateway instance. While the initial cost of a dedicated API Gateway is high, it can be economical in the long run. Conclusion Amazon API Gateway is a very flexible and powerful tool in the cloud environment. However, without proper optimization, unexpected costs may arise. Why not actively utilize the 10 optimization strategies introduced in this blog? Strategies like caching, logging optimization, and integration with CloudFront provide substantial cost-saving effects. Achieve your goal of successful operations by understanding API usage through continuous monitoring and usage pattern analysis, and proceeding with optimization.

  • AWS Bedrock을 사용한 아키텍처 분석 feat. 아쉬운 결말

    AWS Bedrock을 사용한 아키텍처 분석 feat. 아쉬운 결말 Written by Minhyeok Cha 생성형 AI에 대해 엄청 말이 많았을 때, 블록체인처럼 한 번 뜨고 지는 해인 줄 알아서 관심도 주지 않고 공부도 하지 않고 있었습니다. 그런데 언제부터인가 ChatGPT나 Claude를 많이 사용하게 되었습니다. 사진이나 동영상을 자동으로 생성해주는 AI에 대한 글도 자주 보이더라고요. 이런 걸 보니 제가 많이 뒤쳐졌다는 걸 체감하고, 그나마 많이 사용중인 AWS - 그 중에서도 사용이 간단한 AWS Bedrock을 사용해 볼 생각입니다. 생성형 AI의 기본 용어나 기반이 되는 기술부터 정의 하는건 양이 많을 것 같아 넘어가도록 하겠습니다. 목차 AWS Bedrock 사용 Bedrock Knowledge base Knowledge base 만들기 사용된 함수 설명 마무리 (아쉬운 점) AWS Bedrock 사용 AWS Bedrock 사용은 사실 너무 유명해져서 다 아실 수 있겠지만 이 서비스 사용은 서버리스, API 형식의 간단한 사용법이 있습니다. Claude 모델을 예시로 들면 다음과 같이 사용 가능합니다. 그런데 사용 방법을 알아도 목적 없이 쓰려니까 제자리걸음을 하는 것 같아, 우리 회사 블로그를 뒤적이다 보니 예전에 쓴 Well-Architected 관련 글이 눈에 띄었습니다. 이걸 섞어보면 괜찮은 결과물이 나오겠다 싶었고요. 그래서 다음과 같이 한 번 합쳐봤습니다. 간단 아키텍처 설명 자료를 넣고 결과를 출력하는 2개의 버킷 Lambda Bedrock API call 및 Prompt 사용 FPDF import (PDF를 받기 위함) 한글 폰트 적용 Bedrock - Lambda에서 요청한 text 출력 Lambda에서는 layer를 통해 python 라이브러리 중 PDF를 뽑을 수 있도록 해주는 FPDF와 한글로 출력할 수 있도록 폰트 파일을 올렸습니다. Bedrock API는 위 스크립트를 참고하시고 저같은 경우는 아키텍처를 삽입하여 Well-Architected 관점의 개선사항을 출력하도록 텍스트를 집어 넣었습니다. 테스트 용으로 넣은 아키텍처는 간단하게 3-Tier 이미지를 사용했습니다. Lambda 로그는 CloudWatch 로그 그룹에서 확인합니다. 라고는 했지만 따로 로그 출력문을 만들진 않았고 동작 확인 정도만.. 이후 결과물 버킷 확인 잘 나오긴 했지만, 제가 원하던 것은 이런 것이 아니었습니다. AWS Bedrock을 이용해 진행해봤지만, 퀄리티가 마음에 들지 않았습니다. 말투도 바꾸고 싶고, 보여줄 정보와 숨길 정보를 구분하고 싶은데 방법이 마땅치 않았습니다. Bedrock Knowledge base 우리가 간단하게 접근하는 ChatGPT나 Claude도 자기만의 모델을 쓰고 있으며, 이는 각 회사에서 충분히 학습시킨 그들의 모델을 저희가 빌려쓴다고 봐도 무방합니다. 하지만 제 목적은 Well-Architected에 대해 전문적으로 보고서를 작성하는 것입니다. 이런 면에서 AWS Bedrock이 제공하는 Knowledge base 기능이 도움이 되는 것을 확인했습니다. Knowledge base는 간단하게 설명하면 RAG을 지원함으로써 관련성이 높고 정확한 맞춤형 응답을 제공할 수 있습니다. RAG (Retrieval-Augmented Generation) 외부 지식 소스에서 관련 정보를 검색하여 LLM의 응답 생성을 보강하는 방법입니다. 작동 방식: • 검색 (Retrieval): 주어진 쿼리나 질문과 관련된 정보를 외부 데이터베이스나 지식 베이스에서 찾습니다. • 증강 (Augmentation): 검색된 정보를 원래의 프롬프트에 추가합니다. • 생성(Generation): 증강된 프롬프트를 바탕으로 LLM이 응답을 생성합니다. Knowledge base 만들기 Bedrock에 들어가면 위 사진처럼 하단에 지식 기반을 확인할 수 있는데, 이를 클릭하면 필요한 리소스를 제공하라고 나옵니다. 저는 간단하게 S3에 CSV 파일을 넣어 만들어 봤습니다. 이 외에도 크롤링이나 다른 서비스를 결합하여 사용이 가능하다고 합니다. 추가로 Knowledge base API를 사용하는데 이미지는 임베딩 값이 너무 커서 (2만자 제한) 이미지를 읽지 못합니다. 스마일샤크 GenAI SA 김현민 님께 조언을 받은 결과 Multi Modal RAG를 사용한 이미지 분할 임베딩 후 Prompt와 연관을 짓는 맵핑 작업이 필요하다는 것을 깨달았습니다. Knowledge base는 기본적으로 임베딩 값이 Opensearch Serverless에 들어가기 때문에 CSV 파일 및 쿼리할 이미지 임베딩 값을 하나의 Opensearch로 진행할 예정입니다. 다음과 같은 과정을 거친 최종 아키텍처는 다음과 같습니다. 코드 설명까진 양이 생각보다 많아 스킵하겠지만 간단한 Bedrock API 사용 설명 및 결과물을 하나씩 설명하겠습니다. 먼저 사용할 Lambda가 저기에 붙어있는 서비스의 권한들을 할당받아야 하는게 먼저여서 S3 in/output 권한과 bedrock 그리고 opensearch가 있는데 이때 Bedrock에서 만들어 주는건 serverless고 권한도 그에 맞춰야 합니다. 추가로 opensearch serverless는 Bedrock Knowledge base 생성시 자동으로 만들어지며 이 부분에서 서로 연결되도록 짜여있지만 Lambda는 이어지지 않기 때문에 여기서 Lambda ARN을 기입하셔야 임베딩 벡터가 삽입됩니다. 사용된 함수 설명 image_embedding 모델: amazon.tita-embed-image-v1 역할: 이미지를 벡터로 임베딩합니다. 이 임베딩은 이미지의 특성을 수치화한 표현으로, 나중에 유사한 이미지를 검색하거나 이미지의 특성을 분석하는데 사용될 수 있습니다. 💡 임베딩된 벡터값을 확인합니다. 해당 임베딩이 잘 되었는지 확인하기 위해 테스트 해보니 위에서 나온 글자 제한 수에 걸려서 출력이 안된게 잘 나왔습니다. analyze_image 모델: anthropic.claude-3-sonnet-20240229-v1:0 역할: 이미지를 분석하여 상세한 설명을 생성합니다. 이 모델은 아키텍처 이미지를 색상, 행동, 텍스트, 운용 등을 포함한 종합적인 설명을 제공합니다. query_knowledge_base 모델: anthropic.claude-3-sonnet-20240229-v1:0 역할: 이미지 설명을 바탕으로 지식 기반에서 관련 정보를 추출하거나, 관련 정보가 없는 경우 이미지 설명에 대한 일반적인 분석이나 해석을 제공합니다. 위의 이미지 임베딩 값, input으로 넣은 이미지 분석, 이를 토대로 한 매핑 작업, 마지막으로 knowledge base 데이터 소스를 끌어와 query_knowledge_base 함수를 통해 다음과 같은 출력물을 얻을 수 있었습니다. 잘 불러와지기는 하지만 출력한 pdf가 출력된 글이 너저분하게 표시되네요. 해당 부분을 교정하는 작업과 knowledge base 데이터를 조금 더 개선해야 할 상황이 보이긴 합니다. (예시로 넣은 prompt: SEC의 각 숫자에 대해 괄호 안의 정보가 직접 넣은 knowledge base 데이터 소스와 조금씩 다름) 마무리 (아쉬운 점) 사실 이 AWS Bedrock을 사용한 아키텍처 분석에 Well Architected를 첨가하여 계획하고 구성하는 것 까지는 좋았으나 아쉬운 부분이 있었습니다. 먼저 Well Architected는 리소스 기반의 체크리스트도 있지만 회사 운영 프로세스를 잘 관리하고 있는지에 대한 내용또한 많습니다. 이는 하나의 아키텍처로는 전부 출력할 수 없는 부분입니다. 그래서 개선 사항에 프롬프트를 리소스를 기입하거나 회사 지침 등을 추가하여 체크 여부를 늘릴 수 있도록 해야 더 많은 내용이 나올 것 같네요. 다음 사항은 코드도 잘 몰라서 찾아 헤매다 보니 LangChain 프레임워크의 존재 여부를 너무 늦게 알아버린 것입니다. 혹시 다른 분들은 참고하실 수 있도록 LangChain 개념과 관련된 링크를 하나 추가합니다. (저도 읽어보러 가보겠습니다.)

  • AWS IAM Practical Guide: 4 Use Cases for Advanced Configuration and Real-world Implementation (2/2)

    AWS IAM Practical Guide: 4 Use Cases for Advanced Configuration and Real-world Implementation (2/2) Written by Hyojung Yoon Hello! Welcome back to the SmileShark guide on AWS Identity and Access Management (IAM). In this edition, we've prepared content that explores how IAM can be utilized in real business situations, along with hands-on practice exercises. We'll look at frequently asked questions and essential know-how that you can apply immediately in the field. Let's get started! Context Practical IAM Setup: Learning Through Use Cases Strengthening Root Account Security Configuring IAM for Development Teams Managing IAM in Multi-Account Environments Managing External Partner Access Results and Effects of IAM Implementation IAM Security Best Practices IAM Trouble Shooting and Monitoring Common IAM Errors and Their Solutions Monitoring IAM Activities with CloudTrail Frequently Asked Questions (FAQ) What's the key difference between IAM users and IAM roles? How often should we rotate access keys? What should I be mindful of when writing IAM policies? What advantages does using AWS Organizations offer? How should I handle a lost MFA device? Conclusion Practical IAM Setup: Learning Through Use Cases Let's examine the practical implementation of AWS Identity and Access Management (IAM) through four diverse business scenarios, demonstrating its critical role in enhancing security and operational efficiency in enterprise environments. 1. Strengthening Root Account Security The root account, often referred to as the "superuser" account, holds unrestricted access to all AWS services and resources. As such, it requires exceptional security measures. Q: How can I manage the root account securely? What configurations should I implement? A: To manage the root account securely, it's crucial to enable multi-factor authentication (MFA), remove access keys, and establish a robust password policy. These measures significantly reduce the risk of unauthorized access. Practice: Root Account Security Configuration 1) Activate MFA(Multi-Factor Authentication) Log in to the AWS Console > Click on your account name in the top right > Select 'Security credentials' Choose 'Assign MFA' > Enter a device name and select your preferred authentication app > Click 'Next' After confirming the app used for your virtual MFA device, click 'Show QR code' Enter the first number generated by the app in 'MFA code 1' > Enter the subsequent number in 'MFA code 2' > Click 'Add MFA' 2) Remove Root Account Access Keys (if present) Navigate to 'My Security Credentials' > Locate the 'Access keys' section If you find any existing access keys, choose 'Delete' 3) Implement a Strong Pass word Policy Go to IAM dashboard > Select 'Account settings' Confirming the password policy as follows: Custom > Minimum 14 characters, including a mix of uppercase and lowercase letters, numbers, and special characters 💡 Pro Tip: Reserve the root account for initial setup and critical tasks only. For day-to-day operations, rely on IAM user accounts with appropriate permissions. 2. Configuring IAM for Development Teams Now, let's look at how to set up IAM to allow your development team secure access to necessary AWS resources. Q: What's an efficient way to manage AWS resource access permissions for my development team members? A: An effective approach is to create IAM groups, attach relevant policies to these groups, and then add developers to the appropriate groups. This method enables streamlined permission management. Practice: Development Team IAM Setup 1) Create an IAM Group and Attach Policies From the IAM dashboard, select 'User groups' and click 'Create group' Assign a name to the group (e.g., 'DevTeamA') In the 'Add users to group' section, input usernames and select the newly created 'DevTeamA' group For attaching permission policies, choose 'AWS managed policies' and search for and add necessary policies e.g., AmazonS3ReadOnlyAccess, AmazonEC2FullAccess Click 'Next step' and finalize the group creation 2) Generate Access Keys After creating a user, navigate to the 'Security credentials' tab Select 'Create access key' to generate a new access key 💡 Pro Tip: Make it a practice to rotate access keys regularly and promptly deactivate any unused keys to enhance security. 3. Managing IAM in Multi-Account Environments Large organizations often utilize multiple AWS accounts. Let's explore effective ways to manage IAM in such complex environments. Q: We're using multiple AWS accounts. Is there an efficient way to manage them? A: Absolutely. You can implement centralized management using AWS Organizations and utilize cross-account IAM roles to establish secure access between accounts. Practice: Setting Up AWS Organizations and Cross-Account Access 1) AWS Organizations Setup Access the AWS Organizations console Select 'Create organization' and complete the setup process 2) Establish a Cross-Account Role Access the IAM console of the target account Select 'Roles' and click 'Create role' Choose 'AWS account' > 'Another AWS account' and input the account ID you wish to grant access Attach appropriate permission policies (e.g., ReadOnlyAccess) 3) Test Cross-Account Access Use AWS CLI from the source account aws sts assume-role --role-arn arn:aws:iam::TARGET_ACCOUNT_ID:role/ROLE_NAME --role-session-name TEST_SESSION 4. Managing External Partner Access There are times when you need to grant temporary AWS resource access to external partners. IAM roles can be particularly useful in such scenarios. Q: What's the safest method to temporarily grant AWS resource access to an external partner? A: The most secure approach is to create an IAM role, assign only the minimum necessary permissions to that role, and then issue temporary credentials through AWS Security Token Service (STS). Practice: Creating an IAM Role for External Partners 1) Create an IAM Role Sign in to the AWS Management Console and navigate to the IAM service Select 'Roles' and click 'Create role' Choose 'Another AWS account' and enter the partner's AWS account ID in the 'Account ID' field 2) Attach a Permission Policy In the 'Attach policies' section, search for and add the necessary policies For example: AmazonS3ReadOnlyAccess Provide a role name (e.g., ExternalCollaborateRole) Add 'Tags' if needed, then click 'Review' and click to 'Create roles' 3) Generate Temporary Credentials In the IAM dashboard, locate and click on the role you just created Select the 'Trust relationships' tab and click 'Edit trust relationship' In the JSON policy document, verify that the partner's AWS account ID is correctly set Use the 'Copy role URL' or 'Send a link to this role' option to generate a URL for accessing the role 4) Provide Temporary Credentials Securely transmit the generated role URL to your partner Provide clear instructions on how to use the role and any necessary precautions 💡Pro Tip: You can limit the validity period of temporary credentials by configuring the 'Maximum CLU/API session duration' in the IAM role setting. Additionally, enhance security by minimizing the role's permissions to only what's absolutely necessary. Results and Effects of IAM Implementation Implementing the four IAM practices we've discussed can yield significant benefits: Enhanced Security Reduced risk of unauthorized access through root account protection Minimized risk of account compromise through MFA implementation Overall improvement in security posture by applying the principle of least privilege Efficient Management Simplified user management through group-based permission allocation Centralized multi-account management via AWS Organizations Flexible resource access management using cross-account roles Cost Optimization Prevention of resource misuse by eliminating unnecessary permissions Reduced duplicate investments through efficient resource sharing Simplified Compliance and Auditing Meeting regulatory compliance requirements through granular permission control Streamlined auditing and tracking of all IAM activities via comprehensive CloudTrail logging Improved Collaboration Establishment of a secure collaboration environment with external partners Support for smooth collaboration through tailored permissions for each project and team IAM Security Best Practices Adhering to AWS IAM security best practices can significantly enhance your overall AWS environment security: Minimize Root Account Usage : Use the root account only for account creation and initial setup. Perform subsequent operations Enforce a Strong Password Policy : Apply a robust password policy to all IAM users to ensure passwords are not easily guessable. Implement MFA : Set up multi-factor authentication (MFA) for all user accounts to add an extra layer of security. Regularly Review Permissions : Periodically review the permissions of IAM users and roles, and remove any unnecessary permissions. Utilize the Policy Simulator : Use the IAM policy simulator to verify that policies are functioning as intended. Moninor via CloudTrail : Leverage CloudTrail to monitor IAM activities and track who performed what actions. IAM Troubleshooting and Monitoring Common IAM Errors and Their Solutions 1. Access Denied Error Cause: Occurs when a user or role lacks the necessary permissions. Solution: Review the policy of the user or role and add the required permissions. 2. InvalidClientTokenId Error Cause: Happens when an access key is incorrect or has expired. Solution: Verify the access key in the IAM console and generate a new key if necessary. 3. MalfomedPolicyDocument Error Cause: Arises when the JSON format of an IAM policy document is incorrect. Solution: Use a JSON validation tool to check and correct the policy document format. Monitoring IAM Activities with CloudTrail Navigate to the CloudTrail service in the AWS Management Console. Click 'Create trail' to set up a new trail. Provide a trail name and decide whether to apply it to all regions. Choose an existing S3 bucket or create a new one to specify where logs will be stored. Optionally, you can configure sending logs to CloudWatch Logs. CloudTrail allows you to monitor IAM activities in detail and respond swiftly to any security incidents. Frequently Asked Questions (FAQ) Q: What's the key difference between IAM users and IAM roles? A: IAM users are permanent identities that can continuously access AWS resources. They're typically assigned to specific individuals or applications. On the other hand, IAM roles are sets of permissions that can be assumed temporarily when needed. Roles are particularly useful when users need temporary access to AWS resources for specific tasks. Q: How often should we rotate access keys? A: AWS recommends rotating access keys every 90 days to enhance security. However, depending on your organization's security policy, you might choose to rotate them more frequently. Regular key rotation helps mitigate long-term security risks. Q: What should I be mindful of when writing IAM policies? A: The most critical principle when crafting IAM policies is to adhere to the 'principle of least privilege'. This means granting only the minimum permissions necessary to perform required tasks. It's also advisable to minimize the use of wildcards (*) and instead specify concrete resource ARNs whenever possible. Q: What advantages does using AWS Organizations offer? A: AWS Organizations allows for centralized management of multiple AWS accounts. This enables automated account creation, group-based account management, and policy-based integrated management. These features allow for more systematic and efficient operation of large-scale AWS environments. Q: How should I handle a lost MFA device? A: If you lose the MFA device for your root account, contact AWS Customer Support and explain the situation. The support team can deactivate MFA after going through an identity verification process. You can then register a new MFA device. For IAM accounts, you can access with the root account or an IAM account with administrator permissions to remove the MFA device. After that, you can register a new MFA device. Conclusion Throughout this AWS IAM practical guide, we've explored how IAM can be effectively utilized in real-world business scenarios. From enhancing root account security to practicing various scenarios, we've covered practical ways to leverage IAM. IAM stands as one of the most critical security services in the AWS ecosystem. When properly configured, IAM enables secure protection of AWS resources and efficient access control. By applying the best practices outlined in this guide, you can significantly enhance the overall security posture of your AWS environment. Remember, security isn't a one-time task but an ongoing process. Regularly review your IAM settings, stay updated with the latest security recommendations, and strive to maintain a secure and efficient AWS environment.

  • AWS EKS 로그 관리...다들 하고 계시죠?

    AWS EKS 로그 관리...다들 하고 계시죠? - AWS EKS에서 효과적인 로그 관리 Fluent Bit으로 시작하기 Written by Minhyeok Cha 예전에 공부하려고 작성한 노트들을 정리하다 EKS 관련된 글을 발견했습니다. 완전 주니어때 정리한 거라 제가 이해하고 작성한 건지도 모르겠더라고요. 그래서 다시 복습도 할 겸, 블로그 소재로 삼을 겸 하나 들고 왔습니다. 간단한 아키텍처로 먼저 소개하겠습니다. nginx이미지는 그냥 로그 테스트용으로 넣었습니다. 사실 이게 없으면 아키텍처가 너무 심심해 보여서 억지로 끼워 넣은 거라고 봐도 무방합니다. 이제 Fluent-bit이라는 새 친구를 통해 로그 가져다가 OpenSearch로 보내는 과정을 살펴보려고 합니다. 목차 Fluent-bit이란? Fluentd와 Fluent-bit의 차이점 데모 시작 EKS 클러스터 생성, Opensearch 도메인 생성 OICD 생성 OpenSearch 액세스 권한 부여 Fluent Bit 설치 테스트 이미지 배포 및 Fluent Bit 로그 확인 Opensearch 도메인 접속 및 로그 확인 마무리 Fluent-bit이란? 본격적으로 시작하기 전에 Fluent-bit가 무엇을 하는 친구인지 알아보겠습니다. Fluentd는 사실 많이 들었던 EFK (Elasticsearch, Fluentd, Kibana) 에서 F를 담당하는 친구입니다. 이 친구는 EFK의 하나인 만큼 많은 곳에서 신뢰를 얻는 오픈소스 로그 수집기입니다. 그리고 오늘 사용할 Fluent Bit는 최대한 가볍게 사용하기 위해 만들어졌습니다. Fluentd와 Fluent-bit의 차이점 Fluentd Fluent Bit 리소스 사용 많은 리소스를 사용합니다. (일반적으로 수십 MB의 메모리) 매우 경량화되어 있어 메모리 사용량이 적습니다.(보통 수 MB) 기능성 플러그인과 확장 기능을 제공하여 복잡한 로그 처리 시나리오를 지원합니다. 기본적인 로그 수집 및 전송 기능에 초점을 맞춥니다. 구성 복잡성 더 복잡한 구성이 가능하며, 고급 기능을 위해서는 더 많은 설정이 필요할 수 있습니다. 비교적 간단한 구성으로 시작할 수 있습니다. 💡 Fluent Bit이 더 가볍고 사용이 용이하다고 합니다. 데모 시작 1. EKS 클러스터 생성, Opensearch 도메인 생성 💡 이 부분(EKS, 도메인 생성)은 주요 내용이 아니라 패스하겠습니다. 2. OIDC 생성 # eksctl로 만드는 방법 eksctl utils associate-iam-oidc-provider \ --cluster eks \ --approve IAM 역할에 연결할 정책 생성 fluent-bit-policy { "Version": "2012-10-17", "Statement": [ { "Action": [ "es:ESHttp*" ], "Resource": "arn:aws:es:${AWS_REGION}:${ACCOUNT_ID}:domain/${ES_DOMAIN_NAME}", "Effect": "Allow" } ] } 위에서 만든 OIDC와 정책을 가지고 iamserviceaccount 생성 kubectl create namespace log eksctl create iamserviceaccount \ --name fluent-bit \ --namespace log \ --cluster eks-logging \ --attach-policy-arn "arn:aws:iam::${ACCOUNT_ID}:policy/fluent-bit-policy" \ --approve \ --override-existing-serviceaccounts 💡 콘솔로 가서 생성된 역할 및 적용된 정책, Opensearch 도메인 등이 잘 적용되었는지 확인합니다. 3. OpenSearch 액세스 권한 부여 사전에 생성된 혹은 기존에 있는 오픈서치 ID와 PW를 변수로 넣고 다음 명령어를 사용합니다. curl -sS -u " ${ES_DOMAIN_USER} : ${ES_DOMAIN_PASSWORD} " \ -X PATCH \ "https://{opensearchendpoint}/_opendistro/_security/api/rolesmapping/all_access?pretty" \ -H 'Content-Type: application/json' \ -d '[ { "op": "add", "path": "/backend_roles", "value": ["'iamsa:role:arn'"] } ]' 4. Fluent Bit 설치 Fluent Bit은 쿠버네티스 기준으로 config 설정과 데몬셋을 배포해 주면, 완료입니다. 먼저 cluster role을 만들고 service account와 바인딩 작업을 해주는 파일을 하나 작성합니다. --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: fluent-bit-read rules: - apiGroups: [""] resources: - namespaces - pods verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: fluent-bit-read roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: fluent-bit-read subjects: - kind: ServiceAccount name: fluent-bit namespace: log ConfigMap입니다. --- apiVersion: v1 kind: ConfigMap metadata: name: fluent-bit-config namespace: logging labels: k8s-app: fluent-bit data: fluent-bit.conf: | [SERVICE] Flush 1 Log_Level info Daemon off Parsers_File parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 @INCLUDE input-kubernetes.conf @INCLUDE filter-kubernetes.conf @INCLUDE output-elasticsearch.conf input-kubernetes.conf: | [INPUT] Name tail Tag kube.* Path /var/log/nginx/access.log Parser docker DB /var/log/flb_kube.db Mem_Buf_Limit 50MB Skip_Long_Lines On Refresh_Interval 10 filter-kubernetes.conf: | [FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Kube_Tag_Prefix kube.var.log.containers. Merge_Log On Merge_Log_Key log_processed K8S-Logging.Parser On K8S-Logging.Exclude Off output-elasticsearch.conf: | output-opensearch.conf: | [OUTPUT] Name es Match * Host "오픈서치 도메인" Port 443 TLS On AWS_Auth On AWS_Region ap-northeast-2 Index "오픈서치에 들어갈 인덱스" Replace_Dots On Suppress_Type_Name On 제가 config 부분에서 애를 먹었는데, output에서 "Suppress_Type_Name"을 켜야 합니다. 처음에 설치 후 로그를 쏴보니 Fluent Bit에서 OpenSearch로 보낼 수 없다는 오류만 나와서 역할 바인딩 이슈의 문제인 줄 알았습니다. 삽질하여 찾아본 결과, Fluent Bit의 설정이 최신 OpenSearch 버전과 호환되지 않아 그런 거더라고요. 따라서 위와 같이 configmap 설정을 하여 로그 전송에 성공하였습니다. parser.conf 파일도 공식 링크에 잘 작성되어 있으며, 저희는 테스트를 nginx로 진행할 예정이라 다음과 같이 사용했습니다. parsers.conf: | [PARSER] Name nginx Format regex Regex ^(?[^ ]*) (?[^ ]*) (?[^ ]*) \[(?[^\]]*)\] "(?\S+)(?: +(?[^\"]*?)(?: +\S*)?)?" (?[^ ]*) (?[^ ]*)(?: "(?[^\"]*)" "(?[^\"]*)") Time_Key time Time_Format %d/%b/%Y:%H:%M:%S %z [PARSER] Name json Format json Time_Key time Time_Format %d/%b/%Y:%H:%M:%S %z [PARSER] Name docker Format json Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%L Time_Keep On Deamonset 입니다. --- apiVersion: apps/v1 kind: DaemonSet metadata: name: fluent-bit namespace: log labels: k8s-app: fluent-bit-logging version: v1 kubernetes.io/cluster-service: "true" spec: selector: matchLabels: k8s-app: fluent-bit-logging template: metadata: labels: k8s-app: fluent-bit-logging version: v1 kubernetes.io/cluster-service: "true" annotations: prometheus.io/scrape: "true" prometheus.io/port: "2020" prometheus.io/path: /api/v1/metrics/prometheus spec: containers: - name: fluent-bit image: amazon/aws-for-fluent-bit:2.5.0 imagePullPolicy: Always ports: - containerPort: 2020 volumeMounts: - name: varlog mountPath: /var/log - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true - name: fluent-bit-config mountPath: /fluent-bit/etc/ terminationGracePeriodSeconds: 10 volumes: - name: varlog hostPath: path: /var/log - name: varlibdockercontainers hostPath: path: /var/lib/docker/containers - name: fluent-bit-config configMap: name: fluent-bit-config serviceAccountName: fluent-bit 남은 부분은 제가 저번에 작성한 데이터 파이프라인 구축 중 logstash와 비슷한 맥락의 파일입니다. 정확한 설명 참고는 공식 문서를 읽어보면 좋을 것 같습니다. 5. 테스트 이미지 배포 및 Fluent Bit 로그 확인 # 데모 nginx pod 만들기 kubectl run nginx --image=nginx -n default # nginx에 10개의 HTTP GET 요청을 하는 루프 for i in {1..10}; do kubectl exec -it nginx -n default -- curl -s http://{파드 IP 조회해서 기입} | grep -q . && echo "Request $i: Success" || echo "Request $i: Failure" done # nginx pod 로그 확인 kubectl logs nginx -n default # fluent bit 로그 적재 및 전달 확인 kubectl logs (fluent bit pod name) -n log Opensearch 도메인 접속 및 로그 확인 Opensearch에서는 패턴을 만들어야 대시보드에서 로그를 간편하게 볼 수 있습니다. 💡 nginx에 쏜 로그가 전달된 것을 확인할 수 있습니다. 마무리 Fluent Bit이 컨테이너에서 주로 쓰는 로그 수집기로 유명해 AWS OpenSearch와 EKS 클러스터를 통합하는 데모 세션을 EKS 복습 겸 진행해 보았습니다. EKS는 manifest 파일로 운용되는 느낌이 강해서 다시 보니 yaml 파일만 잔뜩 있어 읽기가 귀찮네요. 그래서 더 쉬운 방법이 없나 찾아봤는데, 요즘엔 이런 것도 있더라고요. 이 글을 보시는 분들은 helm 사용법을 찾아보시면 좋을 것 같습니다. 그리고 개인적으로 궁금했던 Logstach가 컨테이너에 못 쓰는 이유를 찾아봤습니다. 사이즈가 무거운 것도 있지만 핵심 기능 중 하나인 영구 큐 같은 기능이 스테이트리스 컨테이너 환경에서 사용이 힘들다고 하네요. 이번에 컨테이너 로그 구축을 하며 느낀 점은, 뭐든 하기 전에 공식 문서를 먼저 읽는 게 좋다는 것입니다. 파일 하나하나 읽어가면서 오랜만에 삽질도 하고 눈도 아팠네요. 여러분들은 이점을 참고하셔서 눈 아프지 마시길 바랍니다.

  • AWS IAM for Beginners: Master the Core Concepts in 10 Minutes (1/2)

    AWS IAM for Beginners: Master the Core Concepts in 10 Minutes (1/2) - Exploring IAM Components: Users, Groups, Roles, and Policies Written by Hyojung Yoon Hello everyone! It's been a while since I've written an AWS blog. For those new to AWS, IAM might seem a bit complex at first. I remember it took me quite some time to fully grasp IAM myself. However, IAM is a crucial service that forms the foundation of security and access control in the AWS environment. It's an essential service that you must understand to effectively manage your AWS resources. In this blog post, I'll explain the basic concepts and importance of IAM in an easy-to-understand manner. Context What is AWS IAM? IAM Components Users Groups Policies Roles IAM User Management Things to consider when creating an IAM user Access key and secret access key How to set user permissions Direct policy attachment vs Group policy attachment Principle of least privilege IAM Policies Deep Dive Policy types Writing policies and important considerations Policy components Policy writing considerations Leveraging IAM Roles IAM role use cases and examples Cross-account access management Comparing IAM roles and policies Conclusion What is AWS IAM? AWS IAM (Identity and Access Management) is a service that allows you to securely control access to AWS services and resources. Simply put, it acts as the gatekeeper of your AWS environment, managing who can do what. With IAM, you can create and manage users, groups, and roles, and finely control access permissions to AWS resources. Using IAM allows you to: Prevent security incidents by granting only the minimum necessary permissions. Easily create and manage access control policies for various scenarios. Optimize costs by restricting unnecessary resource usage. Simplify compliance with corporate security policies and regulatory requirements. Key Features of IAM Shared access to your AWS accoun t: Grant others permission to manage AWS resources without sharing your credentials. Granular permissions : Give different people different levels of access to specific resources. Multi-factor authentication (MFA) : Add an extra layer of security to prevent unauthorized access. Identity Federation : Provide temporary access to AWS resources for users who already have passwords elsewhere, such as in your corporate network or with an internet identity provider. IAM Components IAM consists of four main components: Users, Groups, Roles, and Policies . Policies define detailed permission settings, which are then attached to role, users, or groups. These roles are in turn linked to users or AWS resources to set their permissions. 1. Users IAM users represent individuals or services that use AWS. After creating an IAM user, you can generate the following credentials: A password for logging into the management console Access credentials (access key ID and secret access key) These credentials are different from root user security credentials and are defined for only one AWS account. 2. Groups Groups are collections of users with the same permission s. For example, you can create groups like 'Development Team' or 'Operations Team'. Groups are useful when you need to assign common permissions to multiple users, simplifying the process of managing permissions for several users. Note that IAM groups, unlike IAM users or roles, don't have their own credentials. 3. Policies Policies are JSON documents that define permissions in AWS. They specify rules to allow or deny access to specific resources. These policies are attached to IAM users, groups, and roles. The main elements of a policy are ' Effect ', ' Action ', and ' Resource '. The basic JSON policy structure is as follows: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject"], "Resource": "arn:aws:s3:::example-bucket/*" } ] } Effect Can be either "Allow" or "Deny". It determines whether the statement allows or denies specific actions. Action Defines the actions that can be performed. For example: "s3:GetObject" You can specify multiple actions using wildcards Resource Specifies the AWS resources to which the actions apply. It uses Amazon Resource Names (ARNs) to identify resources. You can specify multiple actions at once using wildcards. * A Quick Note on Permissions : Permissions are the core concept for controlling access to AWS services. They allow you to specify the level of access for users or services very precisely. Characteristics of Permissions: Granularity : Access rights can be specified down to a very detailed level. Flexibility : Permissions can be adjusted to suit various scenarios. Enhanced Security : By granting only necessary permissions, you can improve overall security. Policies are like the rulebooks of IAM. They're essentially JSON documents that bundle together sets of permissions. In other words, a policy is a document that defines what a user can and can't do in your AWS account. 4. Roles IAM roles are a feature that allows you to grant temporary permissions to users or services . IAM roles enable flexible management of permissions needed for specific tasks. Roles use temporary credentials to grant permissions, which is more secure as it doesn't require long-term credentials like access keys. IAM User Management IAM users represent entities that interact with AWS, which could be actual people or applications and services. Each user has unique credentials for accessing AWS resources. 1. Things to consider when creating an IAM user Access Type : AWS Management Console access, programmatic access, or both Permission Scope : Minimum necessary permissions for the user's tasks Group Membership : Manage users with similar roles in groups Password Policy : Apply policies that enforce strong password usage 2. Access keys and Secret access keys These are credentials that allow users to access AWS programmatically. They must be stored securely and immediately changed or deactivated if compromised. 3. How to set user permissions 1) Direct Policy Attachment vs. Group Policy Attachment Direct Policy Attachment : Used when individual users need specific permissions. Group Policy Attachment : Applies permissions to multiple users with similar roles simultaneously. 2) Principle of least privilege This principle involves granting users only the minimum permissions necessary for their tasks, preventing security incidents caused by unnecessary permission grants. IAM Policies Deep Dive 1. Policy Types AWS managed policies : Pre-defined policies provided by AWS for common use cases. Customer managed policies : Policies created and managed by users for specific requirements. Inline policies : Policies directly included in a specific user, group, or role. 2. Policy writing considerations { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::example-bucket/*" "Condition": { "IpAddress": { "aws:SourceIp": "192.0.2.0/24"}} } ] } 1) Policy Components Version : Specifies the policy document version. It's recommended to use the latest version ' 2012-10-17 '. Statement: This is the container that holds the main elements of a policy document. It can include one or more statements. Effect : This defines the result of the policy. It can have two values: 'Allow' or 'Deny'. In our example, we're allowing access (' Allow '). Action : This specifies what actions the policy allows or denies. For instance, in our example, we're allowing the S3 object retrieval action (' s3:GetObject '). Resource : This identifies the AWS resources to which the policy applies. In our example, we're targeting all objects (' * ') in the ' example-bucket '. Condition : This defines the conditions under which the policy applies. Our example uses the ' aws:SourceIP ' condition to allow access only from a specific IP address range, ' 192.0.2.0/24 '. 2) Policy Writing Considerations Adhere to the Principle of Least Privilege : Enhance security by granting only the minimum necessary permissions. Use wildcards(*) cautiously : Use carefully to prevent granting excessive permissions. Implement explicit denials when necessary(Deny) : Takes precedence over Allow, and should be set explicitly when needed. Utilizing IAM Roles 1. Concept and Use Cases of IAM Roles IAM roles are tools for setting specific permissions for IAM users. They are used for authentication and granting temporary access rights to specific AWS resources within an AWS account. For example, you can set up a role to allow an EC2 instance to access an S3 bucket. Both roles and users are AWS credentials that determine what actions can and cannot be performed in AWS according to permission policies. However, roles are not associated with a specific individual and can be assumed by anyone who needs that role. Also, roles don't have standard long-term credentials like passwords or access keys; instead, they provide temporary security credentials. IAM role use cases: EC2 instance accessing S3 bucket : You can grant an IAM role to an EC2 instance to allow it to access an S3 bucket. This allows applications running on the EC2 instance to securely access the S3 bucket. Cross-account access management : You can securely share resources across multiple AWS accounts. For example, you can use roles to safely allow users in a development account to access resources in a production account. Using AWS services : Lambda functions or other AWS services can use roles to access other AWS resources. Providing temporary credentials : You can set up temporary credentials for specific tasks. This enhances security and reduces the risk of long-term credential exposure. 2. Cross-Account Access Management You can use IAM roles to securely share resources across multiple AWS accounts. For example, you can use roles to safely allow users in a development account to access resources in a production account. 3. Comparing IAM Roles and Policies IAM Role IAM Policy Target All specified targets One IAM user, separated by account Credential Type Temporary authorization method Long-term or permanent authorization method Permission Acquisition Method Access allowed only during specified time Access allowed as long as policy is attached Use Cases EC2 instances, Lambda functions, temporary permission management Used for specific resources with detailed access control Components Trusted entity, permission policy Effect, Action, Resource, Condition Main Purpose Enhancing security and flexible permission management Detailed access control Conclusion Through this AWS IAM guide, we've explored the core of IAM. By understanding and applying each component of IAM - users, groups, roles, and policies - in real-world environments, you can manage your AWS environment more securely and efficiently. I hope this guide serves as a solid first step in strengthening your AWS security. In the upcoming practical IAM guide, we'll cover IAM setup and hands-on exercises focusing on real-world cases, so stay tuned! Links What is IAM? - AWS Identity and Access Management Policies and permissions in IAM - AWS Identity and Access Management IAM roles - AWS Identity and Access Management IAM users - AWS Identity and Access Management Using IAM roles - AWS Identity and Access Management Managed policies and inline policies - AWS Identity and Access Management

  • Are AWS Certifications worth it? : AWS SA-Professional 4

    Are AWS Certifications worth it? : AWS SA-Professional (SAP) Certification 4 Written by Minhyeock Cha It's been a while since I last tackled SAP exam questions. With my certification renewal date approaching, I thought it would be a good time to return and share some tips and solutions for the exams I've taken so far. This post will focus entirely on solving certification exam questions. Although it may not directly address the question "Is it really useful?" as the blog title suggests, I'll make sure to include plenty of valuable tips to help you succeed. Question 1. Your company is storing millions of sensitive transactions across thousands of 100-GB files that must be encrypted in transit and at rest. Analysts concurrently depend on subsets of files, which can consume up to 5 TB of space, to generate simulations that can be used to steer business decisions. You are required to design an AWS solution that can cost effectively accommodate the long-term storage and in-flight subsets of data. Ⓐ Use Amazon Simple Storage Service (S3) with server-side encryption, and run simulations on subsets in ephemeral drives on Amazon EC2. Ⓑ Use Amazon S3 with server-side encryption, and run simulations on subsets in-memory on Amazon EC2. Ⓒ Use HDFS on Amazon EMR, and run simulations on subsets in ephemeral drives on Amazon EC2. Ⓓ Use HDFS on Amazon Elastic MapReduce (EMR), and run simulations on subsets in-memory on Amazon Elastic Compute Cloud (EC2). Ⓔ Store the full data set in encrypted Amazon Elastic Block Store (EBS) volumes, and regularly capture snapshots that can be cloned to EC2 workstations. Solutions Since it's been a while, I decided to start with a simple and straightforward problem. This question can be easily solved with basic AWS knowledge and an understanding of storage concepts (e.g., storing more data but with slower access, storing less data but with faster access). Key points from the question are: Daily storage of 100GB of transactions Up to 5TB of storage space Long-term storage Cost-effective solution We can break down the solution based on these four key points. Using the storage concepts mentioned earlier, we can evaluate the options provided: S3, HDFS on EMR, and EBS. Given that S3 is significantly cheaper compared to other storage options, anyone familiar with AWS would know to eliminate the other choices. Additionally, long-term storage correlates directly with cost efficiency, making S3 the obvious answer. 💡 HDFS incurs additional costs related to usage and cluster maintenance, making S3 the obvious answer. Now, we're left with options A and B. The idea of running simulations using EC2 memory suggests that... If you're confident in covering server costs, option B might be viable. However, for those looking to optimize costs, option A is the recommended choice. Answer: A Question 2. You are looking to migrate your Development (Dev) and Test environments to AWS. You have decided to use separate AWS accounts to host each environment. You plan to link each accounts bill to a Master AWS account using Consolidated Billing. To make sure you keep within budget you would like to implement a way for administrators in the Master account to have access to stop, delete and/or terminate resources in both the Dev and Test accounts. Identify which option will allow you to achieve this goal. Ⓐ Create IAM users in the Master account with full Admin permissions. Create cross-account roles in the Dev and Test accounts that grant the Master account access to the resources in the account by inheriting permissions from the Master account. Ⓑ Create IAM users and a cross-account role in the Master account that grants full Admin permissions to the Dev and Test accounts. Ⓒ Create IAM users in the Master account. Create cross-account roles in the Dev and Test accounts that have full Admin permissions and grant the Master account access. Ⓓ Link the accounts using Consolidated Billing. This will give IAM users in the Master account access to resources in the Dev and Test accounts Solutions The question requires a consolidated billing architecture post-migration under the assumption of using separate Dev and Test accounts. The accounts are divided as follows but are structured to allow consolidated billing through AWS Organizations, enabling unified billing. The key challenge here is to prevent budget overruns by allowing the administrator of the master account to implement a way to stop, delete, and/or terminate resources in both the Dev and Test accounts. This involves permission assignment, thus eliminating option D immediately. The core of this problem is not just about viewing the organization structure but focusing on IAM role switching. Resource interference from the master account to other accounts is not a native feature of the organization service, necessitating the implementation of IAM role switching. Steps to Solve: Create an IAM user in the master account. Create an Admin role in both the Dev and Test accounts. By setting up this structure, the master account can control resources in other accounts. Evaluation of Options: Option A:  Incorrect, because creating an Admin policy in the master account does not grant control over other accounts' resources. Option B:  Incorrect, because it creates cross-account roles without allowing access from the master account. Option C:  Correct, as it involves setting up the necessary roles and permissions to allow resource control from the master account. Answer: C Question 3 A company has a web application that allows users to upload short videos. The videos are stored on Amazon EBS volumes and analyzed by custom recognition software for categorization. The website contains static content that has variable traffic with peaks in certain months. The architecture consists of Amazon EC2 instances running in an Auto Scaling group for the web application and EC2 instances running in an Auto Scaling group to process an Amazon SQS-queue. The company wants to re-architect the application to reduce operational overhead using AWS managed services where possible and remove dependencies on third-party software. Which solution meets these requirements? Ⓐ Use Amazon ECS containers for the web application and Spot instances for the Scaling group that processes the SQS queue. Replace the custom software with Amazon Rekognition to categorize the videos. Ⓑ Store the uploaded videos in Amazon EFS and mount the file system to the EC2 instances for the web application. Process the SQS queue with an AWS Lambda function that calls the Amazon Rekognition API to categorize the videos. Ⓒ Host the web application in Amazon S3. Store the uploaded videos in Amazon S3. Use S3 event notification to publish events to the SQS queue. Process the SQS queue with an AWS Lambda function that call the Amazon Rekognition API to categorize the videos. Ⓓ Use AWS Elastic Beanstalk to launch EC2 instances in an Auto Scaling group for the application and launch a worker environment to process the SQS queue. Replace the custom software with Amazon Rekognition to categorize the videos. Solutions This question involves modernizing an existing application currently operated on EC2 instances. Key Points: The site has variable traffic with peaks during certain months and serves "static content." Minimize overhead and remove dependencies. Combine the above key points to identify the correct solution among the provided options. When thinking of static content, you might immediately consider the classic combination of CloudFront and S3. If you thought of this and chose option C, you can move on to the next question. Answer: C However, to provide more detail and context, let's delve into why option C is the correct choice by exploring its architecture and setup. For static content web app hosting and video uploads, S3 buckets are used. 💡 S3 supports web hosting, which is appropriate for static content as mentioned in the question. However, for scenarios requiring WAS (Web Application Server) operations, this wouldn't be the most efficient method. Using CloudFront in front would enhance performance, but since it's not listed in the options, we'll exclude it for now. S3 can trigger events to specific targets. In this scenario, it triggers an SQS queue when new videos are uploaded to the bucket. The SQS queue is then processed by a Lambda function. This Lambda function uses Rekognition for video classification, organizing the videos back into the S3 bucket. This problem can be a bit tricky because all the provided services are designed to be operationally feasible, making it difficult to identify the incorrect option. Therefore, it’s a good idea to clearly understand the core points of the problem and review the options again. The key points of the problem are: The site has variable traffic with peaks during certain months and serves "static content." Minimize overhead and remove dependencies. While static content can be managed easily, minimizing overhead and dependencies requires an architecture that involves minimal manual intervention, utilizing AWS-managed services. Evaluating Other Options Option A:  Uses ECS, which involves managing containers and underlying EC2 instances, thus requiring more human resources. Option B:  Relies on EC2, which again involves managing the servers directly, leading to higher overhead. Option D:  Utilizes Elastic Beanstalk, which abstracts some management but still requires handling of the compute resources. In contrast, option C utilizes fully managed AWS services, aligning with the problem's core requirement of minimizing overhead and dependencies. Conclusion I hope the AWS SA certification questions we covered today have been helpful to you. If you have any questions about the solutions, notice any errors, or have additional queries, please feel free to contact us anytime at partner@smileshark.kr .

  • BCM Educational Group CaseStudy

    About the Customer BCM Educational Group has been leading the industry for 41 years with the goal of “providing the best educational experience that allows you to speak English continuously, little by little every day, anytime, anywhere”. Through FIT-Korean English management and 'BCM U-Phone' service, it was ranked first in consumer preference in the phone English and English conversation categories for six consecutive years, and in the case of classes by foreign instructors, 99.4% satisfaction was achieved. In particular, BCM U-Phone supports an environment where students can learn English conversation anytime, anywhere, even in a non-face-to-face environment, and is leading the domestic phone English market by proposing a variety of curricula according to the purpose of learning. Customer Challenge BCM has been continuously expanding its business scale over the past few years, showing consistent growth. Along with this growth, the usage of cloud services has significantly increased. To improve infrastructure cost efficiency while maintaining business growth momentum, BCM aims to carefully review the current cloud infrastructure and reduce operating costs by removing or optimizing unnecessary resources. Simultaneously, they are considering reviewing business processes to increase efficiency and modernizing the technology stack to introduce more cost-effective solutions. They plan to minimize the license costs of existing Microsoft workloads and prepare for modernization without issues in the production environment. Needed solutions for management and cost issues of existing MSSQL Requested to change OS from existing Windows-based servers to Linux Needed solutions to minimize service impact and resolve downtime issues during migration Required redundancy to minimize downtime and enable automatic service recovery Automated CI/CD pipelines for web services and back-office programs Needed solutions for potential data loss Required real-time performance monitoring and alert settings Desired consistently better performance compared to the existing environment even after modernization Proposed Solution Recommended cost reduction through modernization from existing MSSQL Supported OS change through EBS volume snapshots and configure basic settings through Systems Manager after the change Proposed monitoring and migration using DMS to ensure no impact on service operations Secured high availability by configuring EC2 and databases with multi-AZ (Availability Zone) deployment Automated source code build and deployment through AWS CodePipeline Recommended enabling automatic daily snapshot backups for databases Detected resource usage and configuration changes through CloudWatch and Config, and send real-time notifications via SNS Configured AWS fully managed relational database engine Aurora to improve performance and minimize downtime AWS Tools and technology Computing AWS EC2 AWS Lambda Storage Amazon Simple Storage Service Database Amazon Aurora MySQL Amazon ElastiCache for Redis Network AWS VPC AWS NAT Gateway AWS Internet Gateway Amazon Route53 Amazon CloudFront AWS Elastic Load Balancer Production AWS CodePipeline AWS CodeCommit AWS CodeBuild AWS CodeDeploy AWS Chime AWS Polly Monitoring AWS CloudWatch AWS CloudTrail AWS Config Security AWS WAF AWS KMS AWS Certificate Manager Outcomes of Project & Success Metrics Achieved over 40% cost reduction and eliminated license fees through modernization to Aurora Accomplished data protection, automated initial configuration, system standardization, operational efficiency, enhanced security, and compliance simultaneously through EBS snapshots and AWS Systems Manager Completed zero-downtime migration to Aurora using DMS Achieved high availability with Multi-AZ configuration, enabling continuous service even in the event of failures Completed production deployment with minimal manpower through automated build and deployment of source code Minimized data loss by enabling automatic daily snapshot backups Implemented a more stable service through real-time alert notifications Improved performance by more than 5 times and reduced downtime by over 50% with Aurora MySQL configuration Lesson Learned Achieved cost improvements while eliminating licensing costs through modernization, resulting in improved performance Architecture design considering elasticity and scalability in consideration of future user and capacity increase trend Increase data protection and operational efficiency through automatic backup

  • 우리는 서버리스 시대에 살고 있습니다. “Kinesis” 서버 없는 파이프라인

    우리는 서버리스 시대에 살고 있습니다. “Kinesis” 서버 없는 파이프라인 Written by Minhyeok Cha 원래는 바로 작성하려 했으나, 회사 일정이 겹쳐서 시간이 많이 지났습니다. 한 달 정도 지나긴 했지만, 바로 이어서 Kinesis Data Stream에 대한 설명과 다른 서비스와의 조합 사례를 작성해 보겠습니다. Kinesis는 이전 블로그에서도 언급했듯이, Kafka에서 한번 늘린 파티션(partition)은 다시 줄일 수 없습니다. 반면 Kinesis의 샤드(shard)는 유입되는 데이터의 양에 따라 증설 및 감축이 자유롭기 때문에 운영과 비용 관리가 용이합니다. 💡 데이터 스트림을 처음 접하신다 하면 설치와 구성, 운영이 편한 Kinesis로 시작하는 것을 추천드립니다. Kinesis Data Stream은 리소스 자체를 생성하는 것은 간단하지만, 데이터 인・아웃풋 과정은 SDK, API 사용을 요구하기 때문에 테스트로 제공되는 툴을 사용한 데모와 몇 가지 사용 사례를 통해 쉽게 설명하겠습니다. 지금부터 몇 가지 사례와 간단한 데모를 통해 쉽게 설명해 보겠습니다. 목차 Kinesis Data Stream 내부 흐름 이론 파티션 키(Partition Key) 시퀀스 번호(Sequence Number) 소비자가 데이터를 읽을 때(SDK) Kinesis 사용 사례 Kinesis Firehose를 사용한 경우 앞뒤에 Kinesis Data Stream이 있는 경우 Kinesis 찍먹 데모 마무리 Kinesis Data Stream 내부 흐름 이론 AWS Kinesis Data Stream 또한 이전의 MSK와 마찬가지로 라이브 데이터 스트리밍 전송이 가능하여, 실시간 데이터를 AWS 클라우드로 스트리밍하거나 실시간 비디오 처리를 담당하는 완전 관리형 서비스입니다. 프로듀서가 데이터를 보내면 해시 함수(hash function)를 거쳐 위와 같이 레코드 값에 파티션 키(Partition Key)를 추가하고, 샤드 ID와 시퀀스 번호(Sequence Number)가 반환됩니다. 파티션 키(Partition Key) 파티션 키는 레코드를 특정 샤드에 할당하는 데 사용됩니다. Kinesis는 파티션 키를 해싱하여 해당 레코드가 어느 샤드에 들어갈지를 결정합니다. 같은 파티션 키를 가진 레코드들은 동일한 샤드로 보내집니다. 이를 통해 특정 파티션 키와 연관된 데이터가 같은 샤드에 위치하도록 보장할 수 있습니다. 시퀀스 번호(Sequence Number) 각 레코드가 Kinesis Data Stream에 추가될 때 Kinesis에 의해 자동으로 할당되는 고유한 번호이며 다음과 같은 역할을 합니다. 순서 보장 : 동일한 샤드 내에서 레코드들이 삽입된 순서대로 읽히도록 보장합니다. 소비자는 시퀀스 번호를 사용하여 레코드의 순서를 추적하고, 필요에 따라 특정 시퀀스 번호 이후의 레코드를 읽을 수 있습니다. 중복 방지 : 동일한 레코드가 여러 번 처리되지 않도록 하기 위해 사용될 수 있습니다. 예를 들어, 소비자는 마지막으로 처리한 레코드의 시퀀스 번호를 저장하고, 이후 처리할 때 이 값을 기준으로 새로운 레코드만 읽을 수 있습니다. 소비자가 데이터를 읽을 때 (SDK) 소비자는 특정 샤드에서 순차적으로 데이터를 스트리밍합니다. 이는 소비자의 SDK를 통해 조율할 수 있으며, 상세한 내용은 링크를 참고하시기 바랍니다. AT_SEQUENCE_NUMBER : 필드에 지정된 시퀀스 번호가 표시된 위치부터 스트리밍을 시작합니다 . AFTER_SEQUENCE_NUMBER : 필드에 지정된 시퀀스 번호가 표시된 위치 바로 뒤에서 스트리밍을 시작합니다 . AT_TIMESTAMP : 필드에 지정된 타임스탬프가 표시된 위치부터 스트리밍을 시작합니다 . TRIM_HORIZON : 샤드에서 트리밍되지 않은 마지막 레코드(샤드에서 가장 오래된 데이터 레코드)에서 스트리밍을 시작합니다. LATEST : 항상 샤드의 최신 데이터를 읽을 수 있도록 샤드의 가장 최근 기록 바로 다음에 스트리밍을 시작합니다. ※ SDK로 Consumer를 조율하시는 분들은 필수입니다. Kinesis는 에이전트와 라이브러리, 데이터 파이프라인 구조를 쉽게 만들 수 있는 방법을 제공합니다. 이를 통해 콘솔 상에서도 간편히 사용할 수 있어 Low 레벨의 API를 사용하는 것보다 구성이 더 쉬워졌습니다. Kinesis 사용사례 아실 분들은 아시겠지만, AWS Kinesis는 Data Stream뿐만 아니라 Firehose, Apache Flink 총 세 가지의 작동 방식을 갖고 있습니다. Kinesis Firehose 데이터 전송에 특효 Kinesis Apache Flink 스트리밍 애플리케이션 빌드 및 실행 Firehose는 간단하지만, Apache Flink는 Apache Zeppelin 대시보드를 사용해 코드 영역까지 다루기 때문에 설명이 길어질뿐더러 이번 글의 주요서비스가 아니기 때문에 설명은 생략하겠습니다. 이를 참고하여 두 가지 사용 사례를 아키텍처와 함께 설명하겠습니다. Kinesis Firehose를 사용한 경우 개인적으로 가장 자주 사용하며 쉬운 구성이라고 생각합니다. Firehose는 자체적으로 소스 및 대상을 지정할 수 있기 때문에 연결이 간단합니다. 뿐만 아니라 Firehose를 사용하여 소비자가 직접 샤드를 할당하는 것이 아닌 Stream단위로 선택하여 워크플로우를 연결할 수 있어 매우 편리합니다. 위 설명에 있듯 각각의 샤드가 갖고 있는 데이터가 다르기 때문에 SDK, API 등으로 데이터를 가져올 경우 shard_Id를 지정하여 가져오거나 shard 개수만큼 for loop 문을 사용해 가져와야 합니다. 앞뒤에 Kinesis Data Stream이 있는 경우 Apache Flink에서 Data Stream에 데이터를 KPL을 통해 다시 저장합니다. 이 방식은 위의 Firehose처럼 간단한 작업은 아니지만 Data Stream에 다시 연결하면서 AWS 타 서비스와의 연결점을 만드는 파이프라인 구축에 용이합니다. 또한 위의 사례처럼 Firehose를 사용해 각기 다른 서비스를 사용할 수 있다는 장점도 있습니다. Kinesis 찍먹 데모 데모에서는 Kinesis Firehose 사례를 설명하고, Glue Crawler를 사용해 Athena로 데이터를 출력하는 과정을 보여드릴까 합니다. 앞에 빨간색 서비스 아이콘이 있어도 걱정하지 마세요. 자동으로 만들어진 친구들이 있기 때문에 쉽게 끝낼 수 있습니다. 1. Kinesis Duo를 먼저 생성해야 합니다. Data Stream은 온디맨드로 걸어주세요. 진짜 아무것도 작성 없이도 바로 만들어집니다. Firehose도 위 아키텍처처럼 연결 구성만 해주시고 하단에 S3를 선택해 주시면 됩니다. 2. Amazon Kinesis Data Generator 처음 보는 Kinesis가 나왔지만 놀라지 않으셔도 됩니다. Kinesis Data Generator는 Kinesis에 데이터를 쉽게 삽입할 수 있도록 도와주는 도구입니다. 이를 사용해 데모를 계속 진행해 보겠습니다. 올린 링크를 들어가셔서 cloudformation 템플릿 설치 및 실행 후 cognito 계정 세팅 후 Output에 출력된 링크에 들어가 로그인하여 계속 진행해 주시면 되겠습니다. ※ 해당 링크에 들어가셔서 CloudFormation 템플릿을 다운받아 진행하시면 됩니다. 링크의 내용대로 모두 진행한 후 레코드를 보내면, 웹 개발자 도구를 통해 다음과 같이 확인할 수 있습니다. Kinesis와 S3에도 데이터가 잘 도달했는지 체크해주세요. 3. Glue 세팅 및 크롤링 DB Glue와 크롤러 설정은 간단합니다. S3를 연결하고 크롤링한 데이터를 적재할 DB를 생성한 뒤 사진 상단에 보이는 “Run Crawler” 버튼을 눌러 S3의 데이터를 크롤링합니다. Athena 쿼리 조회 결과, 데이터가 Glue 크롤링 DB에 잘 쌓인 모습을 확인할 수 있습니다. 마무리 이전에 작성한 블로그에서 제가 다음과 같은 말을 한 적이 있습니다. "Kinesis Data Streams에서는 샤드의 개수 조율이 가능합니다." 추가된 샤드에 들어오는 레코드는 새로운 샤드 중 하나에 할당되고, 이는 파티션 키의 해시 값을 기준으로 결정됩니다. 제가 처음 AWS 입문했을 때, Kinesis는 수동으로 샤드를 조정 후 Lambda를 통해 Auto Scaling을 하여 조율하도록 세팅했었습니다. 그러나 이제는 Amazon Kinesis Data Streams 온디맨드가 자동 확장이 가능하다는 것을 알게 되었습니다. MSK와 달리 이런 부분이 비용 절약 및 운영에 많은 도움이 될 것 같습니다. 데모에서도 처음에는 SDK나 Agent를 사용하여 데이터를 삽입하려 했으나, 이것저것 검색해 보니 Amazon Kinesis Data Generator라는 도구 발견하여 쉽게 테스트할 수 있었습니다. 추가적으로 써보는 개인적인 후기 보통 스트림 데이터는 MSK와 Kafka를 많이 사용하기에 저도 Kinesis를 사용해본 것은 AWS 입문 직전에 한두 번 해본 이후로 오랜만이었습니다. AWS 콘솔은 오랜만에 들어가면 콘솔 창이 수시로 업데이트가 되기 때문에 옛날 기억을 더듬어도 힘들더라구요. 그래서 매번 웹서핑하듯 한 번씩 둘러보면서 무엇이 변경되었는가 확인하는 것도 좋을 듯 합니다.

  • AWS 계정에 핵폭탄을 떨구다: AWS-nuke 사용법

    AWS 계정에 핵폭탄을 떨구다: AWS-nuke 사용법 Written by Minhyeok Cha 최근 스마일샤크에 신입 SA 분들이 새롭게 들어오셔서 그런지 회사 내의 테스트 계정의 리소스 사용량이 압도적으로 상승했습니다. 물론 테스트 계정은 SA 분들의 테스트 및 R&D를 위해 아무런 제약 없이 사용하였으나 생각보다 많은 비용을 확인한 운영팀의 말씀 덕에 비용 관리 방법을 찾게 되었습니다. 우리는 AWS nuke라는 기능을 찾게 되었고 사용하면서 느끼게 된 점을 이번 글에 담아보려고 합니다. 해당 기능을 Prod 리소스에 사용할 뻔한 친구가 있었는데 절대 Prod 계정엔 사용하지 마세요! AWS Nuke란? 주의 사항 설치 사용법 명칭 지정 plan delete 간단 테스트 스마일샤크가 현재 쓰고 있는 AWS-nuke 구성 파일 팁 자동 AWS-nuke 실행 마무리 AWS Nuke란? AWS Nuke는 AWS 계정에서 삭제할 수 있는 리소스를 검색합니다. 즉 AWS 관리로 잡히는 기본적인 리소스를 제외한 사용자가 직접 만든 리소스를 모조리 삭제하는 도구입니다. 그렇기 때문에 AWS nuke를 사용하기 위해서는 반드시 주의해야 할 사항들이 있습니다. ⚠️ 주의 사항 ⚠️ 기본적으로 AWS nuke는 위에서 언급한 유저 직접 생성한 리소스만을 삭제합니다. 실제로 리소스를 삭제하기 위해서는 “--no-dry-run”이라는 명령어 추가가 필수입니다. 예시: “aws-nuke -c config.yml --no-dry-run” AWS nuke는 계정의 alias를 입력하여 2번의 삭제 확인 요청을 합니다. 첫 번째: 시작 후에 한번 두 번째: 삭제 가능 리소스 조회 후 한번 사람이 무시할 수 있는 계정만을 표시하는 것을 방지하기 위해 계정에 alias를 만들어야 합니다. 계정 alias는 “prod”가 들어간 문자열이 들어가면 실행이 되지 않습니다. 설정 파일에는 차단 필드가 마련되어 있습니다. nuke 하려는 계정 ID가 이 차단 목록의 일부인 경우 AWS nuke는 중단됩니다. 기본적으로 프로덕션 계정은 모두 차단 목록에 추가하는 것이 좋습니다. 설정 파일에는 계정별 설정이 포함되어 있습니다. 임의의 계정을 실수로 삭제하지 않도록 설정 파일을 지정해야 합니다. 단일 설정 파일만 보유하고 이를 중앙 저장소에 추가하는 것이 좋습니다. 설치 # Mac 설치 brew install aws-nuke # 이후 config.yaml의 경로는 다음과 같습니다. cat /opt/local/share/aws-nuke/examples/example.yaml 사용법 위에서 가져온 example.yaml 파일의 내용은 다음과 같습니다. --- regions: # nuke 실행 리전 기입 - "global" # global 리소스 전용 - "eu-west-1" account-blocklist : # nuke 블랙리스트 AWS ID - 1234567890 # 리소스 타겟 또는 제외 resource-types : # targets - 타겟 / excludes - 제외 targets : - S3Bucket excludes : - IAMUser - IAMUserPolicyAttachment - IAMUserAccessKey # nuke 진행할 AWS ID accounts : 555133742 : filters : # 삭제 대상 리소스 필터 IAMUser : - "admin" IAMUserPolicyAttachment : - property : RoleName value : "admin" IAMUserAccessKey : - property : UserName value : "admin" S3Bucket: - "s3://my-bucket" ※기존 야물 파일을 보면 이것보다 많은데, 간단한 설명을 위해 조금 간추려봤습니다. 명칭 지정 aws iam create-account-alias --account-alias testcha plan aws-nuke -c config.yaml delete aws-nuke -c config.yaml --no-dry-run ※ 실행 시 위 예제 야물이 아니라 직접 config.yaml을 만들고 사용해야 합니다! 간단 테스트 ※ 2개의 계정에 ec2를 하나씩 넣고 aws-nuke를 사용하면 실제로 삭제가 되는지를 확인하는 간단한 테스트를 진행했습니다. ※ 야물 파일에 2개의 계정을 넣어도 실행되는 건 profile에 디폴트로 선택된 계정에만 할당하는 형식인 것 같습니다. 1번 계정 2번 계정 ※ 각 계정의 리소스 삭제를 확인했습니다. 스마일샤크가 현재 쓰고 있는 AWS-nuke 구성 파일 팁 저희는 비용이 많이 나오기 때문에 먼저 비용이 나가는 리소스 항목(targets)과 비용이 나오지 않거나 혹은 주기적으로 사용하는 리소스 항목(excludes)으로 나누어 사용했습니다. ※ 스마일샤크의 목적인 비용 관리에 대한 니즈를 합치면 다음과 같은 내용으로 나옵니다. account에 삭제 대상 aws 계정 12자리를 기입합니다. 이후 하단에 적힌 리소스들은 aws-nuke에서 벗어나는 대상 즉 필터링 하여 지목된 리소스는 제외한 전부를 제외합니다. 위 사진을 예로 든다면 EC2 IP 주소를 value에 직접 넣어 해당 인스턴스를 벗어나게 하는 것이 그 예시입니다. 자동 AWS-nuke 실행 스마일샤크는 단기 테스트 계정, 장기 테스트 계정, 내부 서비스 계정 총 3개의 계정을 돌려 사용 중에 있으며 단기 테스트 계정은 config.yaml 중 accounts에 항시 적용됩니다. 이때 해당 야물 파일을 매번 수동으로 돌리기 귀찮아 구글에 찾아보니 다음과 같은 기능이 있었습니다. 다음 github를 참고: aws-nuke-account-cleanser0example EventBridge에서 사전에 설정한 시간에 맞춰 AWS Step Functions가 실행되고 트리거 된 Codebuild에서 S3 버킷에 있는 config.yaml파일을 가져온 뒤 파이썬 코드가 가져온 config.yaml 파일을 열고 해당 파일의 구성을 읽은 다음 수집된 리소스와 제외 목록을 기존 구성에 추가합니다. 이후 실행되는 코드 중 IAM role을 사용하여 Codebuild가 해당 롤에 적용되어 aws nuke를 사용해 config에 지정된 리전, 계정에 리소스 삭제를 시작하는 구성입니다. 위 작업 과정은 자동으로 사용되는 게 좋지만, 구성 파일이 코드가 돌아갈 때마다 변경되어 S3 파일 안에 있는 구성 파일을 확인하기 귀찮다는 점이 있었습니다. 마무리 자신의 개인 계정 혹은 회사의 테스트 계정에 비용이 많이 들어 리소스를 제거하고 싶을 때 한 번씩 쓰기 상당히 좋은 도구였습니다. 구성 파일을 설정하는 것도 어렵지 않고 실수로 Prod 환경에서 돌릴 경우 안전장치가 몇 겹으로 붙어 있으니, 안전도 어느 정도 보장도 되어 있는 것을 확인할 수 있었습니다. 여러 개의 계정을 하나로 관리하고 싶으시다면 “accounts”에 같이 사용되는 “presets"을 한번 참고하셔도 괜찮을 것 같습니다. 만약 AWS-nuke를 자동으로 실행시키고 싶으시다면 위에서 사용한 Step Functions와 Codebuild도 있었지만 간단한 크론탭을 사용하여 config파일을 지정 후 돌리는 것도 하나의 방법일 수 있습니다.

bottom of page